Principles, Implementation Strategies, and Evaluation of a Corpus Query System

  • Ulrik Petersen
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4002)


The last decade has seen an increase in the number of available corpus query systems. These systems generally implement a query language as well as a database model. We report on one such corpus query system, and evaluate its query language against a range of queries and criteria quoted from the literature. We show some important principles of the design of the query language, and argue for the strategy of separating what is retrieved by a linguistic query from the data retrieved in order to display or otherwise process the results, stating the needs for generality, simplicity, and modularity as reasons to prefer this strategy.


Implementation Strategy Noun Phrase Query Language Language Resource Prosodic Phrasing 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Mengel, A.: MATE deliverable D3.1 – specification of coding workbench: 3.8 improved query language (Q4M). Technical report, Institut für Maschinelle Sprachverarbeitung, Stuttgart, November 18 (1999)Google Scholar
  2. 2.
    Cassidy, S., Bird, S.: Querying databases of annotated speech. In: Orlowska, M. (ed.) Database Technologies: Proceedings of the Eleventh Australasian Database Conference, Canberra, Australia. Australian Computer Science Communications, vol. 22, pp. 12–20. IEEE Computer Society, Los Alamitos (2000)Google Scholar
  3. 3.
    Bird, S., Buneman, P., Tan, W.C.: Towards a query language for annotation graphs. In: Proceedings of the Second International Conference on Language Resources and Evaluation, pp. 807–814. European Language Resources Association, Paris (2000)Google Scholar
  4. 4.
    Lezius, W.: TIGERSearch – ein Suchwerkzeug für Baumbanken. In: Busemann, S. (ed.) Proceedings der 6. Konferenz zur Verarbeitung natürlicher Sprache (KONVENS 2002), Saarbrücken, pp. 107–114 (2002)Google Scholar
  5. 5.
    Heid, U., Voormann, H., Milde, J.T., Gut, U., Erk, K., Pado, S.: Querying both time-aligned and hierarchical corpora with NXT Search. In: Fourth Language Resources and Evaluation Conference, Lisbon, Portugal (May 2004)Google Scholar
  6. 6.
    Rohde, D.L.T.: TGrep2 user manual, version 1.12 (Access Online April 2005) (2004), Available for download online,
  7. 7.
    Bird, S., Chen, Y., Davidson, S., Lee, H., Zheng, Y.: Extending XPath to support linguistic queries. In: Proceedings of Programming Language Technologies for XML (PLANX) Long Beach, California, pp. 35–46 (January 2005)Google Scholar
  8. 8.
    Petersen, U.: Emdros — A text database engine for analyzed or annotated text. In: Proceedings of COLING 2004, 20th International Conference on Computational Linguistics, Geneva, International Commitee on Computational Linguistics, August 23rd–27th, 2004, pp. 1190–1193 (2004),
  9. 9.
    Petersen, U.: Evaluating corpus query systems on functionality and speed: Tigersearch and emdros. In: Angelova, G., Bontcheva, K., Mitkov, R., Nicolov, N., Nikolov, N. (eds.) International Conference Recent Advances in Natural Language Processing 2005, Proceedings, Borovets, Bulgaria, Shoumen, Bulgaria, INCOMA Ltd., September 21-23, pp. 387–391 (2005), ISBN 954-91743-3-6Google Scholar
  10. 10.
    Doedens, C.J.: Text Databases: One Database Model and Several Retrieval Languages. Language and Computers, vol. (14), Editions Rodopi, Amsterdam and Atlanta, GA (1994)Google Scholar
  11. 11.
    Lund, K., Burgess, C.: Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instruments and Computers 28, 203–208 (1996)CrossRefGoogle Scholar
  12. 12.
    McCawley, J.D.: Parentheticals and discontinuous constituent structure. Linguistic Inquiry 13, 91–106 (1982)Google Scholar
  13. 13.
    Lai, C., Bird, S.: Querying and updating treebanks: A critical survey and requirements analysis. In: Proceedings of the Australasian Language Technology Workshop, pp. 139–146 (December 2004)Google Scholar
  14. 14.
    Beckman, M.E., Pierrehumbert, J.B.: Japanese prosodic phrasing and intonation synthesis. In: Proceedings of the 24th Annual Meeting of the Association for Computational Linguistics, pp. 173–180. ACL (1986)Google Scholar
  15. 15.
    Brants, S., Hansen, S.: Developments in the TIGER annotation scheme and their realization in the corpus I. In: Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 2002), Las Palmas, Spain, pp. 1643–1649 (May 2002)Google Scholar
  16. 16.
    Taylor, A., Marcus, M., Santorini, B.: The Penn treebank: An overview. In: Abeillé, A. (ed.) Treebanks — Building and Using Parsed Corpora. Text, Speech and Language Technology, vol. 20, pp. 5–22. Kluwer Academic Publishers, Dordrecht (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Ulrik Petersen
    • 1
  1. 1.Department of Communication and PsychologyUniversity of AalborgAalborg EastDenmark

Personalised recommendations