Background

Polycystic ovary syndrome (PCOS) is the most common endocrine disorder in reproductive age women. The diagnosis is based on its cardinal features, including irregular menstrual cycles, hyperandrogenism and polycystic ovary morphology, with two out of three features required for the diagnosis in the absence of other disorders causing the same symptoms [1, 2]. Additional features are variable, with obesity exacerbating hyperandrogenism and risk for metabolic disorders including impaired glucose tolerance, type 2 diabetes and metabolic syndrome [35]. The associated features depend on the diagnostic criteria employed [57], which differ depending on the specialty of the recruiting physician [8].

Based on the heterogeneous features and the need to rule out other diagnoses before PCOS is ascertained, it may be difficult to use codified data from the electronic medical record to confidently identify patients with PCOS. The most readily available identifier, the International Classification of Diseases, 9th edition code (ICD-9), misclassified 13–20 % of adolescents with PCOS [9]. Conversely, PCOS was confirmed as a diagnosis in only 73 % of adolescents in a separate study [10]. In adult women identified with PCOS using an ICD-9 code, 28 % had documented anovulation and clinical hyperandrogenism in the record, whereas an additional 52 % had only one of these features documented [11]. Additional validation is needed to determine whether the ICD-9 code is accurate in identifying adult women with PCOS.

Other codified data that could be useful to corroborate the diagnosis of PCOS include laboratory measurements and ultrasound results. However, an elevated androgen level is not necessary for a diagnosis of PCOS in the setting of clinical hyperandrogenism and measured levels may be altered by treatment. Further, while laboratory tests could be used to exclude patients with other diagnoses, the results may not be electronically available. Current procedural technology codes may be available to indicate that a pelvic ultrasound was performed. However, the necessary ultrasound parameters for the diagnosis of polycystic ovary morphology, such as ovarian volume, are not typically codified data and cannot be captured as confirmatory information for a PCOS diagnosis. Taken together, the specificity of the ICD-9 code when at least one available confirmatory PCOS feature was available in the electronic medical record is approximately 70–80 %, but other confirmatory codified data may not be readily available. Therefore, more extensive analyses may be needed to identify women with PCOS in electronic medical records.

Natural language processing takes electronic free text and codifies the data into computationally functional categories [12]. These categories can be used to establish diagnostic features useful for selecting women with PCOS and confirming the presenting features. We identified a cohort of women with PCOS using ICD-9 codes and identified a second cohort using natural language processing along with codified data. The primary outcome of the study was a comparison of the positive predictive value of the PCOS diagnosis using the ICD-9 code compared to an algorithm used to identify PCOS that incorporated natural language processing and codified data. The secondary outcome was validation of the algorithm cohort using a previously identified, well-phenotyped cohort of subjects with PCOS. The data highlight the utility and limitations of using natural language processing to accurately identify large sets of women with PCOS.

Methods

Data source

The primary data source was the Partners Healthcare Research Patients Data Registry (RPDR), spanning more than 20 years of data from 4.2 million patients. The database contains over 227 million encounters, 193 million coded ICD-9 diagnoses, 105 million medications, 200 million procedures, 852 million lab values and over 55 million unstructured clinical notes, which are a combination of outpatient visit notes, inpatient discharge summaries, radiology reports, and others. The RPDR population is approximately 55 % female, 72 % Caucasian and patients have an average age of 45.7 with a standard deviation of 23.2 years.

We initially identified women with PCOS using the ICD-9 code 256.4 in the RPDR database (n = 13,670). Two hundred randomly identified charts were reviewed individually for confirmation of a PCOS diagnosis. Twelve records had no notes, labs or ultrasounds available, and were not included in the final count.

Subsequently, an initial, broadly defined dataset (referred to as the broad data ‘mart’; n = 265,481) was identified using the ICD-9 code for PCOS and other potentially relevant ICD-9 codes for inclusion and exclusion of polycystic ovary syndrome (Additional file 1: Figure S1 and Table S1). Women, aged 18 to 74 years, with more than one longitudinal medical record note greater than 50 characters were included in the search. Inclusion codes were PCOS (256.4), menstrual disorders (626.x), female infertility (628.x), hirsutism (704.1), alopecia (704.00), acne (706.x) and diabetes complicating pregnancy (648.0x) (Additional file 1: Table S1 and Figure S1). Ovarian procedures, including wedge resection (65.22 and 65.24), medications (topical acne agents, metformin and isotretinoin) and laboratory tests (high testosterone and DHEAS) were also included.

A second refined datamart was created, which included women between the ages of 18 and 40 years with at least one mention of the term ‘PCOS’ in a clinical note at Massachusetts General Hospital (MGH) or Brigham and Women’s Hospital (BWH) (refined datamart; Additional file 1: Figure S1). Women with a history of fibroids (ICD-9 654.1*), ovarian cysts (ICD-9 620.2*), any eating disorder (307.1*, 307.5*), premature ovarian failure (ICD-9 256.3*), Cushing syndrome (ICD-9 255.0*), endometriosis (ICD-9 617.*) or a history of elevated prolactin (LOINC group: PRL), 17 hydroxy progesterone (LOINC group: 17OHPROG), urine free cortisol (LOINC group: U-F) or follicle-stimulating hormone (LOINC group: FSH) were excluded from the refined datamart (Additional file 1: Figure S1 and Table S2).

The data marts consisted of all electronic records for study patients stored using the i2b2 software (i2b2 v1.6.04; USA) [13]. The i2b2 system is a scalable computational framework for managing human health data and the Workbench facilitates analysis and visualization of such data. The Partners Institutional Review Board approved all aspects of this study and the usual safeguards for human subjects’ data were applied.

Training Set

The full electronic medical record of 50 women sampled randomly from the initial broadly defined datamart and 199 women sampled randomly from the refined datamart population were reviewed by a board-certified clinician investigator (CKW). Patients were classified as definite PCOS, probable PCOS, definite NOT PCOS, or not enough information. A subset of 20 notes from the refined datamart was reviewed by an additional investigator to assess inter-rater reliability of the sample (CCK). For patients classified as true cases (definite or probable), related signs and symptoms, comorbidities and other phenotypes were abstracted from the medical record to inform feature selection and model training for NLP analysis (Table 1).

Table 1 Polycystic ovary syndrome related signs, symptoms, comorbidities, medication, laboratory results, ultrasound findings and other phenotypes abstracted from the medical record to inform feature selection and model training for natural language processing (NLP) analysis

NLP analysis

An expert-defined list of terms (custom dictionary) was created including clinically-relevant phenotypic features of PCOS (i.e. ‘alopecia’, ‘hirsutism’), terms related to comorbidities of PCOS (‘obesity’, ‘infertility’) as well as terms related to potential competing diagnosis (i.e. ‘Cushing’s syndrome’, ’eating disorder’, hypothalamic amenorrhea). The terms were then mapped to the Systemized Nomenclature of Medicine-Clinical Terms (SNOMED-CT), a hierarchically organized clinical health care terminology index with over 300,000 concepts, to allow for variations in language use, or the RxNorm, a normalized naming systemic for generic and branded drugs.

Outpatient notes, discharge summaries, radiology reports, operative notes and pathology reports were then processed using the clinical Text Analysis and Knowledge Extraction System (cTAKES) [14], which processes clinical text notes and identifies a term mentioned in the text, along with qualifying attributes (i.e., negated, non-negated, current, history of, family history of). We computed the number of times each term was mentioned across all notes for each patient.

Training a classification algorithm

A proportional odds kernel machine (POKM) regression procedure for ordinal outcomes prediction [15] was performed on the training set of 198 subjects with available data. The training set consisted of 46 features and the chart-reviewed gold standard PCOS label taking three ordinal levels: definite PCOS (PCOSD), probable PCOS (PCOSP) and no PCOS. The POKM with Gaussian kernel, incorporating non-linear effects of the predictors, improves the prediction performance of the final classification algorithm. The tuning parameters required in the modeling were selected based on the cross-validation and Akaike information criteria as discussed previously [15]. The algorithm was applied to the remaining subjects in the refined datamart and probabilities of having PCOSD and PCOSP were assigned to each subject. A subject is classified as PCOS positive if the predicted probability of having PCOSD, \( {p}_{\mathrm{PCO}{\mathrm{S}}_D} \), exceeds a threshold value. The threshold value was chosen to ensure that among those classified as PCOS positive, 75 % have PCOSD.

Controls

Patients with at least one visit to a women’s health clinic at MGH or BWH and no mention of the term PCOS in a clinical note and no history of clinically-relevant features of PCOS were selected as controls for the study (control pool). Patients selected by the classification algorithm were then matched 1:10 to women in the control group on the basis of age, gender, number of recorded events (diagnosis, procedures lab tests and medications) and earliest and most recent visit in the health system.

Validation

For PCOS subjects identified using ICD-9 codes and predicted definite PCOS and probable PCOS using the algorithm, 200 and 191 charts were reviewed, respectively. The number of chart-review validated PCOS subjects was determined to provide a true positive predictive value. A diagnosis of PCOS was confirmed if at least two of the following three features were present: 1) history or physical exam evidence of hirsutism, acne or alopecia, or an elevated total testosterone or DHEAS level [5], 2) irregular menses as documented in the history, and/or 3) polycystic ovary morphology on ultrasound reports consisting of a volume of at least 10 mL in an ovary without a dominant follicle or cyst and/or a description of a large number of follicles [1]. Presence of one confirmatory feature was considered probable PCOS. Presence of an exclusionary diagnosis, such as anorexia nervosa, hypothalamic amenorrhea or primary ovarian insufficiency was considered definitely not PCOS.

In addition, a list of medical records from subjects recruited with PCOS for a previous study (n = 693) was submitted to determine whether they appeared in the PCOS subject dataset after the algorithm was applied [16]. These subjects had physical exam, laboratory and ultrasound data that confirmed a diagnosis of PCOS by the NIH criteria, as previously described [5].

Results

Using ICD-9 codes, 200 charts (total n = 13,670) were examined to identify confirmatory criteria for the diagnosis of PCOS (Table 2). A total of 132 subjects had 2 confirmatory findings that documented the diagnosis of PCOS, while 29 had one confirmatory finding. The positive predictive value was 74 % for definite PCOS and 90 % for definite and probable PCOS. Of those with two confirmatory findings, 84 % had PCOS documented using NIH criteria. Twenty-two subjects had no confirmatory information for the diagnosis of PCOS. There was a 9.5 % false positive rate, with exclusionary diagnoses including primary ovarian insufficiency (n = 4), endometriosis (n = 1), hyperprolactinemia (n = 2), premenstrual dysphoric disorder (n = 1), eating disorder (n = 3), hypothalamic amenorrhea (n = 1) opioid use (n = 1), pituitary tumor (n = 1), mistaken diagnosis (n = 2) or family history, only (n = 1).

Table 2 Comparison of true polycystic ovary syndrome (PCOS) on chart review in women with PCOS determined using ICD-9 codes or using an algorithm incorporating natural language processing and codified data

In contrast, an initial review of random notes from the broad datamart (Additional file 1: Table S1) identified only 1/17 (5.8 %) with a confirmed diagnosis of PCOS. Differences included a broad range of diagnostic codes used to widen the pool for subsequent algorithm development and no upper age limit. Therefore, age less than 45 at the time of diagnosis or presenting feature was added as an inclusion criteria and eating disorders were added to the exclusion criteria (n = 178,510). However, only 6/50 (12 %) subjects had definite or probable PCOS. The proportion of definitely positive PCOS subjects was too low to proceed with algorithm development, based on previous experience [17].

In the refined datamart, a total of 13,077 patients met the criteria for the study population after exclusions were applied (Additional file 1: Table S2). The refined datamart overlapped with the broad datamart (71 %), but included additional subjects not identified using codified data (Additional file 1: Figure S1). Of the 200 randomly-selected patients in the training set, 93 (46.5 %) were classified as definite PCOS and 59 (29.5 %) as probable PCOS. There were 17 subjects who did not have PCOS for an 8.5 % false positive rate. Thirty-one subjects (15.5 %) did not have available information to confirm the diagnosis of PCOS. The positive predictive value was 85 %, for definite and probable PCOS, similar to that using the ICD-9 codes (p = 0.7).

Algorithm results

The data from 198 subjects in the training set were evaluated with the cTAKES results. Data were collapsed into 36 NLP and 14 codified terms after removing terms that were not found in at least 10 % of subjects. Using these terms, the area under the curve of the algorithm for classifying PCOSD was 0.87. A cutoff of 0.392 was chosen to achieve a positive predictive value of 0.75 for PCOSD and to maximize the number of subjects identified. The positive predictive value for definite/probable PCOS was 91 % (95 % confidence intervals: 0.84-0.96). This cut-off value classifies 6295 patients in the data mart (48.6 %) as definite PCOS.

A subset of 150 charts from subjects with definite PCOS and 41 charts from subjects with probable PCOS were reviewed based on previous studies in diseases with a similar prevalence (Table 2) [17, 18]. When the definite and probable PCOS categories were combined, the review demonstrated a positive predictive value of 96 %. Further, stringent requirements for documentation of two Rotterdam criteria in the record resulted in a 68 % positive predictive value. The majority of these definite PCOS subjects (81 %) also met the NIH criteria for PCOS. The false positive rate was 10 %. The validated categories were not different using ICD-9 codes, extracting subjects using the term “PCOS” in the electronic medical record or using the algorithm (p = 0.2). However, the proportion of subjects for which the diagnosis of PCOS could not be determined was significantly lower (5 vs 11 %; p < 0.04).

Validation results

Of the 693 subjects with PCOS recruited through a previous study, 451 were present in the broad datamart and a subset of 201 was present in the refined datamart. The majority of subjects with PCOS recruited for the previous study did not appear in the datamart because they did not have a sufficient number of notes; they were not patients in the Partner’s system (n = 178), were seen at a Partners hospital other than MGH or BWH (n = 12), or were employees (n = 26). The second most common reason for non-inclusion in the datamart was a documented mildly elevated prolactin that was subsequently normal (n = 17), was drawn during pregnancy or postpartum (n = 3) or was drawn after starting medication that raised prolactin after participation in the study (n = 3). Two subjects had an elevated urine free cortisol but were confirmed not to have Cushing syndrome. None of the PCOS subjects appeared in the control set.

Demographics of cases in the validated cohort and controls

Cases were slightly younger than controls (Table 3). The cases were also less likely to have had a pregnancy documented at one of the Partner’s hospitals. The difference may be based on the design of the study, because all controls were required to have presented for a visit for women’s health. The lifetime maximum BMI was also greater in the cases than in the controls.

Table 3 Demographics of subjects chosen for the refined datamart

Discussion

Using IDC-9 codes or identifying subjects using an algorithm consisting of terms identified in electronic medical records along with codified data from definite PCOS subjects resulted in no significant difference in the positive predictive value for identification of PCOS subjects. However, the use of the algorithm resulted in fewer subjects with absent documentation confirming the PCOS diagnosis. Thus, the use of ICD-9 codes or an algorithm incorporating terms pertinent to the PCOS diagnosis results in a reasonable rate of identifying true cases with PCOS in the Partners Healthcare RPDR. Nevertheless, the use of the developed algorithm may improve confidence in large scale collections and data inquiry by removing indeterminate subjects in studies of PCOS.

There has been no systematic evaluation of the rate of true PCOS subjects identified using ICD-9 codes in adults. Our data suggest that the positive predictive value for PCOS is better than that from previous findings in adolescents with PCOS [9]. In contrast to a 13–20 % misclassification rate, only 8.5 % of subjects in the Partners Healthcare RPDR database were misclassified based on ICD-9 codes, although an additional 11 % did not have confirmatory data in the notes to make a clear PCOS status determination. These data suggest that ICD-9 codes may provide a reasonable proxy for true PCOS subjects if validated in other health record systems.

In contrast to the use of the ICD-9 code for PCOS, ICD-9 codes that identified features of PCOS such as irregular menses, hirsutism and acne were too broad to identify women with documented PCOS. Instead, using the term “polycystic ovary syndrome” in the electronic medical record in the refined datamart included subjects not defined by the ICD-9 code but with a greater specificity for PCOS, similar to ICD-9 coding alone. Taken together, the current electronic medical records database suggests that using broad catchment coding diagnoses would not be specific enough to capture subjects with true PCOS from an electronic medical records cohort.

Despite the moderately accurate performance of the ICD-9 code, greater confidence may be needed when using the collected datasets for analysis of fertility, associated medical problems and PCOS features or for collection of anonymized blood samples for study of genetics. The use of an algorithm containing codified data and parameters identified using the clinical Text Analysis and Knowledge Extraction System has the potential to greatly improve the confidence in patient identification. Previous studies demonstrate the superiority of cTAKES/codified data compared to ICD-9 codes for identifying subjects for large scale studies, with ROC characteristics increasing from 54 to 87 % in studies of depression [18]. Similarly, for inflammatory bowel disease, ROC characteristics improved from 86–89 % to 95–96 % [19]. Remarkably, there was a 38 % improvement in identification of rheumatoid arthritis patients using the cTAKES/codified data algorithms. We did not demonstrate such a remarkable improvement in PCOS patient identification using the same method. However, confirmation relied on physician documentation of the cardinal features of PCOS and these were not available in many charts. If the algorithm had been set at a higher cutoff for the positive predictive value, the proportion of definite PCOS subjects would have been greater, but at the expense of subject number.

The advantages of identifying PCOS subjects using cTAKES along with codified data are many. There can be very poor understanding of PCOS among physicians, and misclassification or missed diagnoses are common [20]. As an example in the current study, an ICD-9 code for PCOS was used during a work up that ultimately revealed an exclusionary diagnosis. On the other hand, there can be failure to understand that an elevated laboratory testosterone level is not necessary to make a diagnosis [21] and the ICD-9 code may not be used to indicate a diagnosis that is truly PCOS. These types of patients may be captured by words documented in electronic medical records during the work up. Previous studies also demonstrate that the specialty of the provider influences the criteria required to make a diagnosis of PCOS [8], resulting in missed diagnoses in some cases. The ability to analyze text may override some of these problems depending on the completeness of the notes utilized.

Indeed, the completeness and detail of the available electronic medical records are the most important factors limiting the algorithm method for identification of PCOS subjects. The algorithm relies on terminology, physical exam findings and an appropriate work up for PCOS. The use of templates that ensure proper documentation may not be flexible if they are not set up for the diagnosis of PCOS. As endorsed by an evidence-based methodology workshop on PCOS [2], providers are encouraged to use the Rotterdam criteria and to document the criteria through which the diagnosis of PCOS was made. If adopted, these measures will increase the power of the identification algorithm for large scale recruitment and data analysis. In addition, documenting menstrual cycle parameters [22] will also increase the ability to detect PCOS patients.

Conclusions

Within the Partners Healthcare RPDR, an algorithm developed using cTAKES and codified data compared with ICD-9 codes resulted in similar positive predictive values for identifying patients with PCOS. However, the algorithm improved confidence in PCOS case identification. The algorithm will be validated in an independent health care system to evaluate the performance with different health care providers and documentation. If validated, the algorithm may prove an invaluable tool for confident accrual of large numbers of women with PCOS.