Abstract
The last decade has seen an increase in the number of available corpus query systems. These systems generally implement a query language as well as a database model. We report on one such corpus query system, and evaluate its query language against a range of queries and criteria quoted from the literature. We show some important principles of the design of the query language, and argue for the strategy of separating what is retrieved by a linguistic query from the data retrieved in order to display or otherwise process the results, stating the needs for generality, simplicity, and modularity as reasons to prefer this strategy.
Keywords
- Implementation Strategy
- Noun Phrase
- Query Language
- Language Resource
- Prosodic Phrasing
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, access via your institution.
Buying options
Preview
Unable to display preview. Download preview PDF.
References
Mengel, A.: MATE deliverable D3.1 – specification of coding workbench: 3.8 improved query language (Q4M). Technical report, Institut für Maschinelle Sprachverarbeitung, Stuttgart, November 18 (1999)
Cassidy, S., Bird, S.: Querying databases of annotated speech. In: Orlowska, M. (ed.) Database Technologies: Proceedings of the Eleventh Australasian Database Conference, Canberra, Australia. Australian Computer Science Communications, vol. 22, pp. 12–20. IEEE Computer Society, Los Alamitos (2000)
Bird, S., Buneman, P., Tan, W.C.: Towards a query language for annotation graphs. In: Proceedings of the Second International Conference on Language Resources and Evaluation, pp. 807–814. European Language Resources Association, Paris (2000)
Lezius, W.: TIGERSearch – ein Suchwerkzeug für Baumbanken. In: Busemann, S. (ed.) Proceedings der 6. Konferenz zur Verarbeitung natürlicher Sprache (KONVENS 2002), Saarbrücken, pp. 107–114 (2002)
Heid, U., Voormann, H., Milde, J.T., Gut, U., Erk, K., Pado, S.: Querying both time-aligned and hierarchical corpora with NXT Search. In: Fourth Language Resources and Evaluation Conference, Lisbon, Portugal (May 2004)
Rohde, D.L.T.: TGrep2 user manual, version 1.12 (Access Online April 2005) (2004), Available for download online, http://tedlab.mit.edu/~dr/Tgrep2/tgrep2.pdf
Bird, S., Chen, Y., Davidson, S., Lee, H., Zheng, Y.: Extending XPath to support linguistic queries. In: Proceedings of Programming Language Technologies for XML (PLANX) Long Beach, California, pp. 35–46 (January 2005)
Petersen, U.: Emdros — A text database engine for analyzed or annotated text. In: Proceedings of COLING 2004, 20th International Conference on Computational Linguistics, Geneva, International Commitee on Computational Linguistics, August 23rd–27th, 2004, pp. 1190–1193 (2004), http://emdros.org/petersen-emdros-COLING-2004.pdf
Petersen, U.: Evaluating corpus query systems on functionality and speed: Tigersearch and emdros. In: Angelova, G., Bontcheva, K., Mitkov, R., Nicolov, N., Nikolov, N. (eds.) International Conference Recent Advances in Natural Language Processing 2005, Proceedings, Borovets, Bulgaria, Shoumen, Bulgaria, INCOMA Ltd., September 21-23, pp. 387–391 (2005), ISBN 954-91743-3-6
Doedens, C.J.: Text Databases: One Database Model and Several Retrieval Languages. Language and Computers, vol. (14), Editions Rodopi, Amsterdam and Atlanta, GA (1994)
Lund, K., Burgess, C.: Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instruments and Computers 28, 203–208 (1996)
McCawley, J.D.: Parentheticals and discontinuous constituent structure. Linguistic Inquiry 13, 91–106 (1982)
Lai, C., Bird, S.: Querying and updating treebanks: A critical survey and requirements analysis. In: Proceedings of the Australasian Language Technology Workshop, pp. 139–146 (December 2004)
Beckman, M.E., Pierrehumbert, J.B.: Japanese prosodic phrasing and intonation synthesis. In: Proceedings of the 24th Annual Meeting of the Association for Computational Linguistics, pp. 173–180. ACL (1986)
Brants, S., Hansen, S.: Developments in the TIGER annotation scheme and their realization in the corpus I. In: Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 2002), Las Palmas, Spain, pp. 1643–1649 (May 2002)
Taylor, A., Marcus, M., Santorini, B.: The Penn treebank: An overview. In: Abeillé, A. (ed.) Treebanks — Building and Using Parsed Corpora. Text, Speech and Language Technology, vol. 20, pp. 5–22. Kluwer Academic Publishers, Dordrecht (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Petersen, U. (2006). Principles, Implementation Strategies, and Evaluation of a Corpus Query System. In: Yli-Jyrä, A., Karttunen, L., Karhumäki, J. (eds) Finite-State Methods and Natural Language Processing. FSMNLP 2005. Lecture Notes in Computer Science(), vol 4002. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11780885_21
Download citation
DOI: https://doi.org/10.1007/11780885_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-35467-3
Online ISBN: 978-3-540-35469-7
eBook Packages: Computer ScienceComputer Science (R0)