BACKGROUND

While ambulatory diagnostic errors are now being increasingly recognized as an important quality and safety issue worldwide,1,2 diagnostic safety suffers from a dearth of ways to measure quality that can be useful and used for catalyzing improvement.3 To make meaningful progress, organizations and clinicians need practical, effective, and sustainable tools that can engage clinicians and patients to help close care and communication gaps in a myriad of interrelated areas.4 Akin to an ophthalmoscopic examination, clinical note documentation can provide a window into the quality of the diagnostic process. Notes are not only a recorded representation of the clinician’s diagnostic activities and thinking, but also serve as important communication tools for ensuing visits, other clinicians, and, increasingly, patients accessing their medical records via Open Notes.5 Particularly where there is diagnostic uncertainty, clear and thorough documentation is important for transparently capturing clinicians’ assessments and guiding next steps.6

Clinical notes remain a challenge with widely voiced issues, especially note quality and time burden for writing notes. Many have expressed concerns that note quality is worsening, with the heavy use of copy/pasted and templated notes.5,6,7 Of particular concern are shortcomings in the clear and accurate conveyance of diagnostic thinking and assessment. For example, notes in diagnostic error malpractice cases often lack meaningful assessments, and in many cases are lacking any differential diagnosis or evidence of meaningful thinking about the patient’s problem.8 In many notes, templated check boxes have displaced richer narrative assessments.9,10 Understanding note quality both descriptively and prescriptively to improve diagnostic assessment is thus an untapped frontier. As patients increasingly gain access to their notes, the diagnostic content of notes assumes added importance.11,12,13 The lack of standardization, shared best practices, streamlined workflows, and note “assessment” language is so vast, it is safe to say that virtually every physician has a unique style of note writing. This variation can be viewed as another electronic health record (EHR) area in need of more regimentation, or one that can be leveraged as an opportunity for learning.14,15 This paper takes the latter view and presents the results of a qualitative content analysis review of urgent care notes to advance our understanding of how well these notes convey differential diagnoses and deal with contextual elements, diagnostic uncertainties, and contingencies. In addition, since studies highlight the importance of engaging patients by shifting to Open Notes,16,17,18 this study aimed to assess patient and clinician readability.

METHODS

Study Purpose and Design

Objective

The purpose of this study was twofold: (a) to develop a tool assessing diagnostic assessment quality in urgent care settings and (b) to pilot-test this chart assessment tool and validate it on a sample of 100 charts.

Sample

Using a convenience sampling, we recruited 5 primary care physicians (PCPs) (3 females) seeing primarily urgent care patients in an academic primary care clinic who had at least 20 urgent care encounters during the study period. PCPs were invited into the study via email. Physician-informed consent was obtained at the initiation of the study by the Research Assistant (JR). A total of 100 charts of urgent care encounters by the selected 5 PCPs were chosen for analysis via a random number generator.

Study Design

We conducted a retrospective chart review of 100 urgent care encounters which occurred during January–December 2018 at Brigham and Women’s Hospital from 5 PCPs recruited into the study. The review was conducted in three stages. First, the initial tool (see below for description) was tested on a sample of 25 urgent care charts by 6 reviewers who met biweekly for 3 months (two PCPs, a medical student, a research assistant, a communication expert and qualitative researcher, and a health services researcher). In the second stage, three team members (RA, Med Student, Qualitative Research PhD) individually scored 16 training charts, which were taken from the final list of 100 urgent care charts and met 3 times to resolve disagreements. They consulted the principal investigator to resolve scoring discrepancies and reached a 100% agreement. Lastly, one team member (4th-year medical student) scored the 100 charts, using the coding manual and calibrated Likert scales we developed.

MeasureDevelopment andValidation

Based on literature review,2,19,20,21 existing EHR evaluation tools, malpractice insurers’ recommendations, and best practices for writing clinical notes, we developed the Assessing the Assessment (ATA) tool to examine the quality of documented diagnostic assessments (Table 1). The tool was aimed at understanding current documentation practices and developing guidelines for optimizing notes’ diagnostic assessments including their organization, content, and readability. It examined a series of diagnostic assessment elements (e.g., addressing chief concern, differential diagnosis, uncertainty, red flags, time frames, contingencies), in addition to evaluating pejorative language, note organization/succinctness, and readability level for clinicians and patients. Although in our past work we have developed a tool to assess errors in history-taking and physical examining shortcomings,22,23 here we primarily focused on the recorded diagnostic assessments, given that they are the most tangible representation of the clinician’s diagnostic thinking and decision-making.24

Table 1 “Assessment of the Assessment” Tool

We examined components of the tool’s construct validity, including valid content, response process, internal structure, relations to other variables, and consequences.25 Our analysis showed that we reached good content validity given the sources for the items, response process was straightforward and confirmed by experts in the field, and by a scoring process being carefully documented, with all raters keeping track of their comments and questions, internal consistency/reliability was satisfactory with a reasonable Cronbach’s alpha for internal consistency and kappa for inter-rater reliability (see below), and consequence validity showing favorable ability to categorize charts for diagnostic quality. We could not assess relationship to other variables (external validity), since the ATA is a unique and novel instrument.

The study was approved by the Institutional Review Board as a medical records study.

Analysis

Data were analyzed using content analysis26,27 to assess the diagnostic quality of 100 charts by applying the key concepts of the ATA tool, which contained 4 sections (Diagnostic assessment, Follow-up plan, Situational awareness, and General subjective assessment), each with several sub-items. We iteratively refined the tool based on the pilot reviews including adding several new sub-items to diagnostic domains (e.g., readability for the patient and clinician and utilization of inappropriate copy/paste and templating). Based on team discussions during the first stage of the study, we refined operational definitions for each of the items and scores. For example, we operationally defined our patient readability item examining for the presence (or absence) of medical jargon or abbreviations the patient might not understand, coherence of the prose, conformance to 8th-grade reading level, and general formatting.

Items were scored on a 7-point scale (1, absent; 2, minimal; 3, less than good; 4, neutral; 5, good; 6, very good; 7, excellent) to capture the granularity of the content. We considered an item present if it appeared anywhere in the chart. However, we gave a higher ranking to elements present in the assessment plan. We used alpha = 0.05 for all statistical tests.

RESULTS

The overall kappa for inter-rater reliability during the review of the 16 training charts was 0.63. The highest-scored individual item was “Avoiding legal liability pejorative red flags” (mean 7.0, SD [0.0]). Noting/addressing potential diagnostic pitfalls received the lowest scores (mean 1.5, SD [1.4]). The General diagnostic assessment domain received a slightly higher score (mean 3.4, SD [1.1]) than the domains of Follow-up plan (mean 2.7, SD [1.8]) and Situational awareness (mean 2.7, SD [1.1]). The Other diagnostic factors, a domain containing a variety of descriptive elements, was the highest-scored domain (mean 5.7, SD [1.2]). In general, there was variability in the quality of individual elements comprising the four domains (Table 2). Within the General diagnostic assessment domain, Psychosocial information was absent in 82% of the charts, with a mean score of 1.6 (SD [1.4]). In contrast, a listing of possible etiologies was missing in only 13% of the charts with a mean score of 4.1 (SD [1.5]). Figure 1 shows the percentage of missing elements in the ATA in the charts reviewed.

Table 2 Quality of Diagnostic “Assessment of the Assessment” Elements in the Chart: Results of Review of 100 Charts
Fig. 1
figure 1

ATA elements missing in the charts, by domain, %. The y-axis shows percent of charts with a missing element. The elements are grouped by domain.

Situational Awareness

We created a “situational awareness” domain based on recommendations related to key elements recommended by experts and the National Academy of Medicine diagnosis report. Mention of “Don’t miss diagnoses” was absent in 16% of the charts, with a mean score of 3.7 (SD [1.8]). Red flags, a related construct referring to key worrisome symptoms, were absent in only 13% of the charts, whereas pitfalls to avoid scored poorly with 86% of the notes missing this information (mean 3.0, SD [1.4]) (Table 2).

Other Key Domains

Within the Follow-up plan, a similar variability was observed. Noting of any planned diagnostic tests was missing in about a third of the charts (mean 3.5, SD [2.1]), while any mention of the clinical time frames was missing from nearly three-quarters of the charts and received the mean score of 1.9 (SD [1.8]).

The Other diagnostic factors domain contained elements which assessed important auxiliary elements of chart writing, such as succinctness, clinician and patient readability, and avoiding pejorative language. The patient readability element received the lowest score in this domain (mean 3.9, SD [0.9]), while avoiding pejorative language was the highest-scored element, with virtually no instances of the use of clearly stigmatizing or pejorative terms found in our study sample (mean 7.0, SD [0]).

Finally, our clinical team made a subjective judgment about the overall quality of diagnosis, whether adequate tests were ordered, and whether providers avoided over-testing. The adequate tests and avoiding over-testing elements both received moderately high scores (means 5.7, SD [1.3] and mean 6.5, SD [0.9] respectively) (Table 2).

A few exemplary high-quality charts were identified, but no trends could be derived from the small sample. Figure 2 illustrates the uneven distribution of the mean scores across the domains.

Fig. 2
figure 2

Mean scores of the ATA elements. All elements of the ATA are included in the chart. The line represents mean score for each of the elements.

Internal Consistency

The overall internal consistency between the 19 elements in the ATA was acceptable28 (α = 0.76), although some subscales had less favorable consistency. Of the four subscales, Diagnostic assessment had favorable consistency with an alpha of 0.71. Situational awareness had borderline consistency (α = 0.63), while the other two scales did not show evidence for internal consistency (Follow-up plan (α = 0.38), and General subjective assessment (α = 0.29)). There were clinically meaningful (r > 0.5) correlations between several of the subscales: the Diagnostic assessment domain was most strongly associated with Situational awareness (r = 0.60) and Follow-up plan (r = 0.57). The weakest association was observed between Follow-up plan and General subjective assessment (r = 0.12).

Within the domain of Diagnostic assessment, the overall internal consistency was acceptable (α = 0.71).

The domains of Follow-up plan and General subjective assessment had unacceptable internal consistency (α = 0.38 and α = 0.29 correspondingly).

DISCUSSION

We developed and deployed a novel tool that was able to be used by a medically trained researcher with reasonable reliability and evidence of many aspects of construct validity in assessing the elements of notes’ diagnostic assessment that expert consensus suggests should be present in the documentation of the diagnostic assessment. The quality of the notes varied from both clinician to clinician, encounter to encounter, and, more notably, across different domains of our Assessing the Assessment tool.

We found a somewhat consistent pattern where certain elements of the ATA with the lowest scores were absent from most of the charts (e.g., Psychosocial information and Pitfalls), suggesting clinicians’ assessments could benefit from improved documentation in these areas. We also observed a striking imbalance in the quality of documentation of the ATA elements (Fig. 2). While approaching good results in the general writing and readability style, the notes lacked some clinical elements experts recommend good assessment documentation.

Being a novel tool, the internal consistency of the ATA overall and its individual elements has room for improvement. Overall consistency was good, but individual subscales will require further attention with further evolution of the ATA metric.

While we agree that busy clinicians cannot be expected to write (or dictate or have scribed) lengthy narrative assessments, nor is this desirable if the resulting note is not focused or succinct enough to be useful for subsequent readers, certain elements are important and are currently less well documented and possibly less communicated to the patient. To the extent that templated check boxes have displaced more meaningful narratives, electronic clinical documentation has not fulfilled a role in improving clinicians’ notes. It is difficult to defend a diagnostic assessment that does not at least give a meaningful picture of what is going on with the patient including some recognition of uncertainties and probabilities in primary care diagnoses. Thus, some comment on the likely diagnosis and its degree of certainty as well as consideration of other alternatives would seem to be a reasonable expectation. It would also be helpful to have a sense of the clinician’s assessment of factors (exposures, etiology, psychosocial) that may be contributing to or impacted by the medical diagnosis the clinician is communicating to the patient—something we found lacking in the majority of notes. Finally, although they rarely were present in the notes we reviewed, consideration of red flags, alternative don’t miss diagnoses, and various pitfalls (e.g., limitations of diagnostic test results) could sharpen clinicians’ thinking and enhance the quality of their notes, as well provide medical legal support in defending rare but inevitable instances of missed or delayed diagnoses.

Some items, perhaps, stand in conflict with others (i.e., a more succinct note could conflict with producing a more comprehensive note). However, depending on how this metric is used, we can envision it could be helpful for providing a framework for better notes and feedback to a clinician for both the strengths and weaknesses of their own notes. While there are numerous ways notes have been tied to billing and coding, expectations could be changed to allow more emphasis on diagnostic reasoning that would better support the clinical care of the patient. What ultimately defines a good note? Ideally, determining this will need to be tied to various outcomes (patients’ health, satisfaction with their care and notes, usability, and usefulness for other clinicians).

How can busy and burnt-out clinicians be better supported in producing better diagnostic assessment documentation?29 It is our hope that outlining the explicit framework embodied in our novel tool can help guide them to write better notes that incorporate these suggested features. With advances in voice recognition30 and potentials for interactive computerized prompts or scribe-facilitated documentation, we can envision ways to both streamline note writing as well as more reliably ensure higher quality notes. This could be useful both educationally, to teach trainees how to write a good assessment, as well as a tool for practicing clinicians to evaluate the quality of their own notes. Given the recent requirement by the 21st Century Cures Act that all notes be made readily available to patients, we envision and would like to help catalyze a transformative sea change in the use and value of this aspect of clinician-patient communication around diagnosis.18

LIMITATIONS

The assessment tool is based on best practices recommendations from diagnosis safety and malpractice/risk management literature and experts. We recognize that it has not been validated by correlating the ratings with what occurred in the exam room or workloads during the visit, nor with clinical outcomes. However, we chose relatively easily operationally definable measures that did demonstrate good inter-rater reliability and had good face validity for items of importance in a recorded assessment. Whether higher scored notes led to better readers’ understanding of the clinician’s thinking, enhanced assessments and likelihood of establishing the correct diagnosis with fewer errors or delays, or were worthy investments of added time, requires additional study.

We measured a limited number of physicians who practice in a primary care clinic who primarily see urgent care patients at an academic medical center; this may impede generalizing our findings to more typical primary care encounters more of a mix of chronic disease management and along with an assessment of any acute new symptoms. Furthermore, while this pilot study suggests ATA can be efficiently applied to evaluate clinical notes’ assessment features, there is a subjective element to the ratings, which may vary among raters. Nonetheless, we created and iteratively refined standardized operational definitions to anchor the Likert scales for each of the items and reached a reasonably good kappa agreement score on a sample of charts that were blindly reviewed by two reviewers.

Since we did not directly observe the actual encounters for these notes’ visits, we were not able to ascertain whether aspects such as red flags, psychosocial assessment, or timeframes were in fact discussed during the visits, rather than simply not documented in the chart. Nonetheless, the goal of this project was to first develop an understanding of how notes are written. Based on this initial project, we have conducted paired evaluation assessment of encounters and their corresponding charts, which is published elsewhere20. Future work will also focus on the reliability of the two subscales with low alphas and attempt to further refine the item inclusion to improve this.

CONCLUSION

We developed a tool to measure the presence and quality of key aspects of documented diagnostic assessments for patients presenting for evaluation of acute symptoms in a primary care clinic. We found significant variation in the completeness and quality across different domains and items, suggesting opportunities for improvement. These findings assume added importance given the universal patient access to clinical notes mandated in the USA by the 21st Century Cures Act coupled with growing interest in improving diagnostic thinking and processes to ensure more reliable diagnosis and higher quality diagnostic communication with patients.31,32,33