Tree pattern expression for extracting information from syntactically parsed text corpora

Choi, Yong Suk

doi:10.1007/s10618-010-0184-8

Tree pattern expression for extracting information from syntactically parsed text corpora

Published: 14 July 2010

Volume 22, pages 211–231, (2011)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Yong Suk Choi¹

346 Accesses
8 Citations
Explore all metrics

Abstract

With the public availability of a number of syntactically parsed text corpora, it has been increasingly important to efficiently extract desired information from such corpora. Many conventional works extract a desired text part by matching the parse tree of each sentence to a query that is represented as a structural form of relational predicates expressing a common structural pattern of desired text parts. However, although those works can be useful for limited types of simple queries, they are not very efficient in general because query formulations are sometimes very complicated for complex patterns of desired text parts and query matching tasks are likely to be exponentially time-consuming when considering a variety of complex sentential structures in a text corpus. In order to overcome such inadequacy, we present a novel tree pattern expression (TPE) that can represent various structural patterns intuitively and reduce pattern-matching complexity significantly. This paper first proposes TPE and its pattern-matching algorithm, and then theoretically analyzes the complexity of the proposed pattern-matching algorithm. It also illustrates a TPE-based information extraction system, which is applied to real text mining in a bio-text corpus. It finally shows some experimental results with some discussions in comparison with other systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Abbreviations

TPE:: Tree Pattern Expression
RE:: Regular Expression

References

Kepser S (2003) Finite structure query: a tool for querying syntactically annotated corpora. Proceedings of 10th conference of The European chapter of the association for computational linguistics. pp 179–186
Kim J-D, Ohta T, Tsujii J (2008) Corpus annotation for mining biomedical events from literature. BMC Bioinf. doi:10.1186/1471-2105-9-10
Klein D, Manning C (2002) Fast exact inference with a factored model for natural language parsing. In: Advances in neural information processing systems 15 (NIPS’02). Cambridge University Press, Cambridge, pp 3–10
König E, Lezius W, Voormann H (2003) TIGERSearch 2.1 user’s manual. IMS, University of Stuttgart, Stuttgart
Lai C, Bird S (2004) Querying and updating treebanks: a critical survey and requirements analysis. In: Proceedings of the Australasian language technology workshop. pp 139–146
Lee SK, Choi Yong S et al (2003) Identification of novel anti-angiogenic factors by in silico functional gene screening method. J Biotechnol 105: 51–60
Article Google Scholar
Levy R, Andrew G (2006) Tregex and Tsurgeon: tools for querying and manipulating tree data structures. In: Proceedings of 5th international conference on language resources and evaluation (LREC’06)
Mírovský J(2008) Netgraph-making searching in treebanks easy. In: Proceedings of the third international joint conference on natural language processing (IJCNLP’08). pp 945–950
Němec P (2006) Tree searching rewriting formalism. In: Proceedings of the 5th international conference on language resources and evaluation (LREC’06). pp 2194–2199
Riedel S, Klein E (2005) Genic interaction extraction with semantic and syntactic chains. In: Learning language in logic (LLL05). ICML05 Workshop, Bonn, Germany
Rohde D (2009) Tgrep2. Technical report. Carnegie Mellon University. http://tedlab.mit.edu/~dr/Tgrep2/. Accessed 3 Sep 2009
Rosario B, Hearst M (2004) Classifying semantic relations in bioscience texts. In: Proceedings of the 42nd meeting of the association for computational linguistics (ACL’04). pp 430–437
Su K, Wu M, Chang J (1994) A corpus-based approach to automatic compound extraction. In: Proceedings of the 32nd annual meeting of the association for computational linguistics (ACL’94). pp 242–247
Wallis S, Nelson G (2000) Exploiting fuzzy tree fragment queries in the investigation of parsed corpora. Lit Linguist Comput 15(3): 339–361
Article Google Scholar
Yakushiji A, Tateisi Y, Miyao Y, Tsujii J (2000) Use of a full parser for information extraction in molecular biology domain. Genome Inf 11: 446–447
Google Scholar

Download references

Author information

Authors and Affiliations

Devision of Computer Science and Engineering, Hanyang University, Seongdong-gu, Seoul, 133-791, Korea
Yong Suk Choi

Authors

Yong Suk Choi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yong Suk Choi.

Additional information

Responsible editor: Chih-Jen Lin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Choi, Y.S. Tree pattern expression for extracting information from syntactically parsed text corpora. Data Min Knowl Disc 22, 211–231 (2011). https://doi.org/10.1007/s10618-010-0184-8

Download citation

Received: 22 September 2009
Accepted: 22 June 2010
Published: 14 July 2010
Issue Date: January 2011
DOI: https://doi.org/10.1007/s10618-010-0184-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Tree pattern expression for extracting information from syntactically parsed text corpora

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

Information extraction from electronic medical documents: state of the art and future research directions

How natural language processing derived techniques are used on biological data: a systematic review

Abbreviations

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Tree pattern expression for extracting information from syntactically parsed text corpora

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

Information extraction from electronic medical documents: state of the art and future research directions

How natural language processing derived techniques are used on biological data: a systematic review

Abbreviations

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation