An Automatic Unsupervised Querying Algorithm for Efficient Information Extraction in Biomedical Domain
In the domain of bioinformatics, extracting a relation such as protein-protein interations from a large database of text documents is a challenging task. One major issue with biomedical information extraction is how to efficiently digest the sheer size of unstructured biomedical data corpus. Often, among these huge biomedical data, only a small fraction of the documents contain information that is relevant to the extraction task. We propose a novel query expansion algorithm to automatically discover the characteristics of documents that are useful for extraction of a target relation. Our technique introduces a hybrid query re-weighting algorithm combining the modified Robertson Sparck-Jones query ranking algorithm with a keyphrase extraction algorithm. Our technique also adopts a novel query translation technique that incorporates POS categories to query translation. We conduct a series of experiments and report the experimental results. The results show that our technique is able to retrieve more documents that contain protein-protein pairs from MEDLINE as iteration increases. Our technique is also compared with SLIPPER, a supervised rule-based query expansion technique. The results show that our technique outperforms SLIPPER from 17.90% to 29.98 better in four iterations.
Unable to display preview. Download preview PDF.
- 1.Agichtein, E., Gravano, L.: Querying Text Databases for Efficient Information Extraction. In: Proceedings of the 19th IEEE International Conference on Data Engineering (ICDE), pp. 113–124 (2003)Google Scholar
- 3.Cohen, W.W., Singer, Y.: Simple, Fast, and Effective Rule Learner. In: Proceedings of the Sixteenth National Conference on Artificial Intelligence and Eleventh Conference on Innovative Applications of Artificial Intelligence, July 18-22, pp. 335–342 (1999)Google Scholar
- 6.Song, M., Song, I.Y., Hu, T.: KPSpotter: A Flexible Information Gain-based Keyphrase Extraction System. In: Fifth International Workshop on Web Information and Data Management (WIDM 2003), pp. 50–53 (2003)Google Scholar