Combining classifiers for robust PICO element detection
- First Online:
- Cite this article as:
- Boudin, F., Nie, JY., Bartlett, J.C. et al. BMC Med Inform Decis Mak (2010) 10: 29. doi:10.1186/1472-6947-10-29
Formulating a clinical information need in terms of the four atomic parts which are Population/Problem, Intervention, Comparison and Outcome (known as PICO elements) facilitates searching for a precise answer within a large medical citation database. However, using PICO defined items in the information retrieval process requires a search engine to be able to detect and index PICO elements in the collection in order for the system to retrieve relevant documents.
In this study, we tested multiple supervised classification algorithms and their combinations for detecting PICO elements within medical abstracts. Using the structural descriptors that are embedded in some medical abstracts, we have automatically gathered large training/testing data sets for each PICO element.
Combining multiple classifiers using a weighted linear combination of their prediction scores achieves promising results with an f-measure score of 86.3% for P, 67% for I and 56.6% for O.
Our experiments on the identification of PICO elements showed that the task is very challenging. Nevertheless, the performance achieved by our identification method is competitive with previously published results and shows that this task can be achieved with a high accuracy for the P element but lower ones for I and O elements.
Helping physicians to formulate their clinical information needs thorough well-built, focused questions and is one critical process of evidence-based practice (EBP) [1, 2]. Without a well-focused question, it is more difficult and time consuming to identify appropriate resources and search for an answer . Classical EBP teaching suggests that clinical questions can be separated in terms of four anatomic parts: Population/Problem (P), Intervention (I), Comparison (C) and Outcome (O), known as PICO elements . For example, the question "In children with an acute febrile illness, what is the efficacy of therapy with acetaminophen or ibuprofen in reducing fever?" can be formulated as:
Population/Problem: children/acute febrile illness
Formulating a well-focused question according to the PICO framework facilitates searching for a precise answer within a large medical database . However, using PICO terms in the information retrieval process is not straightforward. It requires the search engine to have detected and indexed PICO elements in the collection in order for the system to retrieve relevant documents. To our knowledge, no system has undertaken this level of indexing. In our pilot work we demonstrated that PICO elements are found in nearly all abstracts .
In terms of detecting PICO elements, it is not practical to annotate these elements at the phrase level due to significant un-resolvable disagreement and inter-annotator reliability issues . This is why most previous work has focused on identifying PICO elements at a sentence level.
To date there is no satisfactory method of accurately predicting PICO elements from a corpus. In this study, we tested multiple supervised classification algorithms and their combinations for detecting PICO statements within medical abstracts. In continuation of previous work, we proposed to tackle the issue of extracting PICO elements in medical abstracts as a classification task and investigated the challenges of detecting these elements at the sentence-level.
Several previous approaches have reported promising results when categorizing sentence types in medical abstracts using classification tools [5–8]. Knight et al. showed that Machine Learning could be applied to label structural information of sentences (i.e. Introduction, Method, Results or Conclusion) using a combination of relative sentence position and word distribution.
Demner-Fushman and Lin  have presented a method that use either manually crafted pattern-matching rules or a combination of basic classifiers to detect PICO elements in medical abstracts. Prior to that, the Metamap  program is used to annotate biomedical concepts in abstracts while relations between these concepts are extracted with SemRep , both tools being based on understanding the Unified Medical Language System (UMLS). The method described by Demner-Fushman and Lin obtained interesting results with an accuracy of 80% for predicting when the phrase contains a description of the population and intervention, 86% for problem and between 68% and 95% for outcome. It has to be noted that these scores are difficult to put into context due to the modest size of the test corpus (143 abstracts for outcome and 100 abstracts for other elements). Most of the errors are related to complementary processing such as inaccurate sentence boundary identification, chunking, Part-Of-Speech (POS) tagging or word sense disambiguation in the Meta-thesaurus. Based on this observation, we decided not to rely on software leveraging semantic knowledge resources.
Recently, supervised classification was proposed by Hansen et al. to extract the number of trial participants. Results reported in their study show that the linear Support Vector Machine (SVM) algorithm achieves the best results with an f-measure of 86%. This may not be representative of a real-world task with only 75 highly topic-related abstracts used as testing set. Chung  extended this work to I and O elements using Conditional Random Fields (CRF). To overcome data sparseness, PICO-structured abstracts were automatically gathered from Medline in order to train and test classifiers. Experiments on a manually annotated test set (318 abstracts) show that promising results were obtained (f-measure of 83% for I and 84% for O) .
However, this study has several weaknesses. First, performance for each PICO element is computed in conjunction with the four generic rhetorical role classes (i.e. Aim, Method, Results and Conclusion). This methodology introduces bias by removing sentence candidates (sentences containing PICO elements are considered to occur only in the method section). Moreover, the rhetorical roles of previous, current and next sentences are included in the features used for classification, while in many medical documents these roles are not explicitly indicated, thus unavailable. Second, as POS tags are used as features, errors committed by the tagger will result in erroneous feature extraction. Finally, by using words as features and by including previous/following sentences feature sets, sentences are characterized by very high-dimensional feature vectors, which require high computational costs for their processing.
PICO elements are more often implicitly described in medical documents. One can use linguistic patterns for this. However, the rule/pattern-based approach may require a large amount of manual work, and the robustness has yet to be proved on a large dataset. In this study, we tested a robust statistical classification approach, which requires less manual preparation.
Construction of training and test data
Using supervised machine learning techniques requires both training and testing data sets. This is one major issue as the task of collecting data in a specialized domain has to be supervised by domain experts. This is also the reason why previous studies have been based on a small set of abstracts in tests. One solution is to use the structural information embedded in some abstracts for which the authors have clearly stated distinctive sentence headings. Some abstracts do contain explicit headings such as "PATIENTS", "SAMPLE" or "OUTCOMES", that can be used to locate sentences corresponding to PICO elements. Below is a segment of a document extracted from Medline using the PubMed http://www.ncbi.nlm.nih.gov/pubmed interface (PMID: 19318702) that includes PICO elements that are clearly identified:
[...]PARTICIPANTS: 2426 nulliparous, non-diabetic women at term, with a singleton cephalic presenting fetus and in labour with a cervical dilatation of less than 6 cm. INTERVENTION: Consumption of a light diet or water during labour. MAIN OUTCOME MEASURES: The primary outcome measure was spontaneous vaginal delivery rate. Other outcomes measured included duration of labour [...]
Statistics about the training data
Features used for classification
Prior to classification, each sentence underwent pre-processing treatments that replaced words into their canonical forms. Alphanumeric numbers were converted to numeric numbers while each word appearance in a series of manually crafted cue-words/verbs lists was investigated. The cue-words and cue-verbs were determined manually, some examples are shown below:
Cue-verbs: conduct (P), recruit (P), randomize (I), prescribe (I), assess (O), record (O)
Cue-words: population (P), group (P), placebo (I), treatment (I), mortality (O), outcome (O)
Statistics about the semantic type lists.
ULMS Semantic type identifiers
List 1 (Living Beings)
Age Group (T100), Family Group (T099), Group (T096), Human (T016), Patient or Disabled Group (T101), Population Group (T098)
List 2 (Disorders)
Acquired Abnormality (T020), Anatomical Abnormality (T190), Cell or Molecular Dysfunction (T049), Congenital Abnormality (T019), Disease or Syndrome (T047), Experimental Model of Disease (T050), Finding (T033), Injury or Poisoning (T037), Mental or Behavioral Dysfunction (T048), Neoplastic Process (T191), Pathologic Function (T046), Sign or Symptom (T184)
List 3 (Chemicals & Drugs)
Amino Acid, Peptide, or Protein (T116), Antibiotic (T195), Biologically Active Substance (T123), Biomedical or Dental Material (T122), Carbohydrate (T118), Chemical (T103), Chemical Viewed Functionally (T120), Chemical Viewed Structurally (T104), Clinical Drug (T200), Eicosanoid (T111), Element, Ion, or Isotope (T196), Enzyme (T126), Hazardous or Poisonous Substance (T131), Hormone (T125), Immunologic Factor (T129), Indicator, Reagent, or Diagnostic Aid (T130), Inorganic Chemical (T197), Lipid (T119), Neuroreactive Substance or Biogenic Amine (T124), Nucleic Acid, Nucleoside, or Nucleotide (T114), Organic Chemical (T109), Organophosphorus Compound (T115), Pharmacologic Substance (T121), Receptor (T192), Steroid (T110), Vitamin (T127)
Statistical features (marked with *) and knowledge-based (marked with †) features extracted for classifying sentences.
Position in the document (absolute, relative) *
Sentence length *
Number of punctuation marks *
Number of numeric numbers n > 10, n < 10 *
Word overlap with title *
Number of cue-words (P, I, O)†
Number of cue-verbs (P, I, O)†
MeSH semantic types
Number of (n = [0-9]+) †
PICO Identification process
Tagging each document was performed in a three-step process. First, the document was segmented into plain sentences. Then each sentence was converted into a feature vector using the previously described feature set. Finally, each vector was submitted to multiple classifiers, one for each element, allowing the system to label the corresponding sentence. We used several algorithms implemented in the Weka toolkit http://www.cs.waikato.ac.nz/ml/: J48 and Random forest (decision trees), SVM (radial kernel of degree 3), multi-layer perceptron (MLP) and Naive Bayes (NB). For comparison, a position classifier (BL) was included as baseline in our experiments. This baseline method was motivated by the observation that PICO statements are typically found in specific sections of the abstract, which are usually ordered in Population/Problem, Intervention/Comparison and Outcome. Therefore, the relative position of a sentence could also reasonably predict the PICO element to which it is related. Similar methods to define baseline have been used in previous studies . Demner Fushman et al , used the three first or last sentences of each abstract as the baseline. However, comparing classifiers that are restricted to label only one sentence per abstract with a multi-sentence baseline may lead to bias.
For each experiment, we report the precision, recall and f-measure of each PICO classifier. To paint a more realistic picture, 10-fold cross-validation is used for each classification algorithm. One round of cross-validation involves partitioning a sample of data into complementary subsets, performing the analysis on one subset (training set, 90% of the data), and validating the analysis on the other subset (testing set, 10% of the data). To reduce variability, 10 rounds of cross-validation were performed using different partitions, and the evaluation results were averaged over the rounds. Moreover, all sentence headings were removed from data sets converting all abstracts into unstructured ones. This treatment allowed us to have a more real-world scenario by avoiding biased values for features relying on cue-words lists.
The output of our classifiers is judged to be correct if the predicted sentence corresponds to the labelled one.
Performance of each classifier in terms of precision (p), recall (r) and f-measure (f).
As different classification algorithms performed differently on different PICO elements, in the second series of experiments, we used three strategies to combine classifier's predictions. The first method (F1) used voting: sentences that have been labelled by the majority of classifiers were considered candidates. In case of ambiguity (i.e. multiple sentences with the same number of votes), the average of the prediction scores were used to make a decision. The second and third methods computed a linear combination of the predicted values in an equi-probable scheme (F2) and using weights empirically fixed according to the observed f-measure ranking (F3) (i.e. for the P element: 5 for MLP, 4 for RF, 3 for J48, 2 for SVM and 1 for NB).
Combining multiple classifiers using F3 achieved the best results with a f-measure score of 86.3% for P, 67% for I and 56.6% for O. This strategy always outperformed, in terms of f-measure, the best classifier alone.
Performance of the Outcome classifiers in terms of f-measure (f) at 2 and 3 sentence cut-off.
We tested a combination of classifiers to tackle the issue of sentence-level PICO element detection. Best results were obtained with a weighted linear combination of the prediction scores. Interestingly, not all fusing strategies always outperformed the best classifier alone. This is may be due to the high variation of performance between the classifiers.
Comparing our results with those published in previous studies is not an easy thing to do as testing data sets are different and therefore not directly comparable. However, considering that we performed 10-fold cross validation testing and that the size of our test data is considerably larger, our results suggest that this methodology tends to give more reliable results.
The O or I elements are more difficult to identify than P elements. The reason is not exclusively due to the decreasing amount of training data available but mainly to the task complexity. Indeed, I elements are often misclassified because of the high number of possible candidates. For example not only do drugs have a generic or ingredient term but may have several trade names. Terms belonging to the semantic groups usually assigned as I (e.g. drug names) are scattered throughout the abstract. Another reason is the use of non PICO-specific vocabulary, i.e. terms occurring in multiple PICO elements. For example, although treatments are highly related to intervention, they can also occur in other elements.
In the case of O elements, because abstracts generally contain more than one outcome, the data set we used for training is not really suited for the task. The fact that we used only one sentence per element, while building our training data, is a strong limiting factor. Sentence headings such as "OUTCOMES" clearly refer to several elements that are likely to be contained in more than one sentence. Previous work has shown that human annotators typically mark two of three sentences in each abstract as outcomes. Based on these observations, Demner Fushman  has proposed to evaluate the performance of the outcome classifier at a cut-off of two and three sentences. In the third series of experiments, we followed this assumption. Only SVM performance remained constant at the different sentence cut-off. This is due to the fact that the classifier produces binary prediction values that do not permit labelling more than one sentence with a statistically significant difference over the others. Results confirm that the strategy consisting of a weighted combination of the prediction scores (F3) always performs better. Although this evaluation roughly captures the performance of our classifiers, it shows that at a sentence cut-off of three, we are able to capture most of the outcomes.
Our experiments on the identification of PICO elements confirm that the task is very challenging. Using the structural descriptors that are embedded in some abstracts has allowed us to collect large data sets that would have been too costly and time-consuming to produce manually. The single sentence approach per element was very restrictive but a pragmatic approach. Tagging all the sentences that are under a heading is not a good solution either as the structural boundaries of an abstract can be vague. The question is, can we tolerate some noise in the training data? In case of a positive answer, one can think that the amount of training data can act as a smoothing by minimizing the impact of the false positive samples. But let us consider two examples (PMID: 18265550 and 18263693):
[...]PATIENTS:In total 686 limbs in 574 patients at various clinical ...
The clinical manifestations were categorized according to the CEAP ...
The distribution of venous insufficiency including the sapheno-femoral ...
The main duplex-derived parameters assessed were the reflux time ...
The venous reflux was assumed to be present if the duration of reflux was ...
The data obtained by APG were on VV (mL), VFI (mL/s), EF (%) and RVF ...
RESULTS: There was no significant difference in overall superficial venous [...]
[...]PATIENTS:96 children, median (interquartile range) age 4.8 year...
None received growth hormone treatment.
MAIN OUTCOME MEASURES: Two types of scoliosis were identified [...]
In the first abstract, it is clear that considering all the sentences below the PATIENTS descriptor as P statements brings too many wrongly labelled samples. For the second example, the middle sentence about growth hormone treatment belongs to the P element; useful secondary information is contained in this but potentially more important information is in the first sentence.
It has to be noted that the features we used were not relying on manually crafted patterns or Part-Of-Speech tagging. Errors introduced by pre-processing were therefore not propagated to higher levels.
In this study, we tested a robust statistical approach to PICO element detection within medical abstracts. The performance achieved by our identification method was competitive with previously published results in the overall precision of recall. The goal of this study was to understand if sentence level PICO detection was possible from a restricted set of features using Machine Learning techniques. Results showed that this task could be achieved with a high accuracy for the P element but not for I and O elements. The main issue remains in the evaluation. Having a sufficient number of manually annotated abstracts is now our priority. To this purpose, we are developing a web annotation tool that allows healthcare professionals to manually annotate Medline abstracts.
The work described in this paper was funded by the Social Sciences and Humanities Research Council (SSHRC). The authors would like to thank Dr. Ann McKibbon, Dr. Dina Demner-Fushman, Lorie Kloda, Laura Shea and Lucas Baire for their contribution in the project.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.