Skip to main content
Log in

Knowledge Discovery in Grammatically Analysed Corpora

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Collections of grammatically annotated texts (corpora), and in particular, parsed corpora, present a challenge to current methods of analysis. Such corpora are large and highly structured heterogeneous data sources. In this paper we briefly describe the parsed one-million word ICE-GB corpus, and the ICECUP query system. We then consider the application of knowledge discovery in databases (KDD) to text corpora. Following Cupit and Shadbolt (Proceedings 9th European Knowledge Acquisition Workshop, EKAW '96; Berlin: Springer Verlag, pp. 245–261, 1996), we argue that effective linguistic knowledge discovery must be based on a process of redescription or, more precisely, abstraction, based on the research question to be investigated. Abstraction maps relevant elements from the corpus to an abstract model of the research topic. This mapping may be implemented using a grammatical query representation such as ICECUP's Fuzzy Tree Fragments (FTFs). Since this abstractive process must be both experimental and expert-guided, ultimately a workbench is necessary to maintain, evaluate and refine the abstract model. We conclude with a pilot study, employing our approach, into aspects of noun phrase postmodifying clause structure. The data is analysed using the UNIT machine learning algorithm to search for significant interactions between domain variables. We show that our results are commensurable with those published in the linguistics literature, and discuss how the methodology may be improved.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Aarts, B., Nelson, G., and Wallis, S.A. 1998. Using fuzzy tree fragments to explore English grammar. English Today, 14:52–56.

    Google Scholar 

  • Abeille (ed.) 1999. Journées ATALA sur les corpus annotés pour la syntaxe—Treebanks workshop, Paris: ATALA.

    Google Scholar 

  • Biber, D. 1988. Variation across speech and writing. Cambridge: Cambridge University Press.

    Google Scholar 

  • Brill, E. 1992. A simple rule-based template tagger. In Proc. 3rd International Conference on Applied Natural Language Processing, Trento, Italy. Association for Computational Linguistics, New Jersey, pp. 152–155.

    Google Scholar 

  • Briscoe, T. 1996. Robust Parsing. In Survey of The State of the Art in Human Language Technology (on-line document), R.A. Cole, J. Mariani, H. Uszkoreit, A. Zaenen, and V. Zoe (Eds.). http://cslu.cse. ogi.edu/HLTsurvey, (+/ch3node9.html).

  • Burnage, G. and Dunlop, D. 1992. Encoding the British National Corpus. In English Language Corpora: Design, Analysis and Exploitation, J. Aarts, P. de Haan, and N. Oostdijk (Eds.). Amsterdam: Rodopi.

    Google Scholar 

  • Clark, P. and Niblett, T. 1989. The CN2 Induction Algorithm. Machine Learning, 3:263–283.

    Google Scholar 

  • Corbridge, C., Rugg, G., Major, N.P., Shadbolt N.R., and Burton, A.M. 1994. Laddering: Technique and tool use in knowledge acquisition. Knowledge Acquisition, 6:315–341.

    Google Scholar 

  • Cupit, J. and Shadbolt, N.R. 1996. Knowledge discovery in databases: Exploiting knowledge level redescription. In Advances in Knowledge Acquisition. Proc. 9th European Knowledge Acquisition Workshop, EKAW'; 96, N.R. Shadbolt, K. O'Hara, and G. Schreiber (Eds.). Berlin: Springer Verlag. pp. 245–261.

    Google Scholar 

  • Fang, A.C. 1996. Automatically Generalising a Wide-Coverage Formal Grammar. In Synchronic Corpus Linguistics, C. Percy, C. Meyer, and I. Lancashire (Eds.). Amsterdam and Atlanta: Rodopi, pp. 131–146.

    Google Scholar 

  • Fayyad, U. 1997. Editorial. Data Mining and Knowledge Discovery, 1:5–10.

    Google Scholar 

  • Greenbaum, S. (Ed.) 1996. Comparing English Worldwide: The International Corpus of English. Oxford: Clarendon Press.

    Google Scholar 

  • Haan, P. de. 1986. Exploring the Linguistic Database: Noun Phrase Complexity and LanguageVariation. In Corpus Linguistics and Beyond, W. Meijs (Ed.). Amsterdam: Rodopi, pp. 151–165.

    Google Scholar 

  • Haberman, S.J. 1978. Analysis of Qualitative Data: Introductory Topics, Vol. 1, New York: Academic Press.

    Google Scholar 

  • Huddleston, R. 1984. Introduction to the Grammar of English. Cambridge: Cambridge University Press.

    Google Scholar 

  • Kondopoulou, P. 1997. Famine in Aigio: Analysing Memories of the Greek Famine. MA Dissertation, University of East London.

  • Lakatos, I. 1981. History of Science and Its Rational Reconstructions. In Scientific Revolutions, Oxford Readings in Philosophy, I. Hacking (Ed.) (1981). Oxford: OUP, pp. 107–127.

    Google Scholar 

  • McEnery, A. and Wilson, A. 1996. Corpus Linguistics, Edinburgh: EUP.

    Google Scholar 

  • Marcus, M., Kim, G., Marcinkiewicz, M.A., MacIntyre, R., Bies, M., Ferguson, M., Katz, K., and Schasberger, B. 1994. The Penn Treebank: Annotating predicate argument structure. In Proc. Human Language Technology Workshop. San Francisco: Morgan Kaufmann.

    Google Scholar 

  • Mooney, R.J. 1997. Inductive logic programming for natural language processing. In Proc. 6th International Workshop on Inductive Logic Programming, Berlin: Springer Verlag, pp. 3–21.

    Google Scholar 

  • Mooney, R.J. and Califf, M.E. 1995. Induction of first order decision lists: Results on learning the past tense of English verbs. Journal of Artificial Intelligence Research, 3:1–24.

    Google Scholar 

  • Muggleton, S. 1998. Inductive logic programming: Issues, results and the LLL challenge (abstract). In Proc. 13th European Conference on Artificial Intelligence, ECAI-98. Chichester: John Wiley. p. 697.

    Google Scholar 

  • Michie, D. 1986. On Machine Intelligence (2nd ed.). Chichester: Ellis Horwood.

    Google Scholar 

  • Nelson, G., Wallis, S.A., and Aarts, B. in press. Exploring Natural Language: The British Component of the International Corpus of English. Amsterdam: John Benjamins.

  • Niblett, T. and Bratko, I. 1987. Learning decision rules in noisy domains. In Bramer, M.A. (Ed.). Research and Development in Expert Systems, 3:25–34.

  • Quinlan, J. R. 1993. C4.5: Programs for Machine Learning, San Mateo, CA: Morgan Kaufmann.

    Google Scholar 

  • Quirk, R., Greenbaum, S., Leech, G., and Svartvik, J. 1985. A Comprehensive Grammar of the English Language. London: Longman.

    Google Scholar 

  • Shaw, M. and Gaines, B. 1987. An interactive knowledge elicitation technique using personal construct technology. In Knowledge Acquisition for Expert Systems: A Practical Handbook. A. Kidd (Ed.). New York: Plenum Press.

    Google Scholar 

  • Wallis, S.A. 1993. Machine learning with knowledge. In Proc. MLnet Workshop on Scientific Discovery 1993, MLnet, pp. 123–131.

  • Wallis, S.A. and Nelson, G. 1997. Syntactic parsing as a knowledge acquisition problem. In E. Plaza and R. Benjamins (Eds.). Knowledge Acquisition, Modeling and Management. Proc. 10th European Knowledge Acquisition Workshop, EKAW'; 97, Berlin: Springer Verlag. pp. 285–300.

    Google Scholar 

  • Wallis, S.A., Aarts, B., and Nelson, G. 1999. Parsing in reverse—Exploring ICE-GB with Fuzzy Tree Fragments and ICECUP. In Corpora Galore, Papers from 19th Int. Conf. on English Language Research on Computerised Corpora, ICAME-98, J.M. Kirk (Ed.). Amsterdam: Rodopi, pp. 335–344.

    Google Scholar 

  • Wallis, S.A. 1999a. Completing parsed corpora: From correction to evolution. In Abeille, 1999. pp. 7–12.

  • Wallis, S.A. 1999b. Machine Learning for Knowledge Discovery, PhD Thesis. University of Nottingham.

  • Wallis, S.A. and Nelson, G. 2000. Exploiting fuzzy tree fragments in the investigation of parsed corpora, Literary and Linguistic Computing, 15:339–361.

    Google Scholar 

  • Wu, X. 1995. Knowledge Acquisition from Databases. Norwood, NJ: Ablex.

    Google Scholar 

  • Zadeh, L. 1965. Fuzzy sets. Information and Control, 8:338–353.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wallis, S., Nelson, G. Knowledge Discovery in Grammatically Analysed Corpora. Data Mining and Knowledge Discovery 5, 305–335 (2001). https://doi.org/10.1023/A:1011453128373

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1011453128373

Navigation