Abstract
Collections of grammatically annotated texts (corpora), and in particular, parsed corpora, present a challenge to current methods of analysis. Such corpora are large and highly structured heterogeneous data sources. In this paper we briefly describe the parsed one-million word ICE-GB corpus, and the ICECUP query system. We then consider the application of knowledge discovery in databases (KDD) to text corpora. Following Cupit and Shadbolt (Proceedings 9th European Knowledge Acquisition Workshop, EKAW '96; Berlin: Springer Verlag, pp. 245–261, 1996), we argue that effective linguistic knowledge discovery must be based on a process of redescription or, more precisely, abstraction, based on the research question to be investigated. Abstraction maps relevant elements from the corpus to an abstract model of the research topic. This mapping may be implemented using a grammatical query representation such as ICECUP's Fuzzy Tree Fragments (FTFs). Since this abstractive process must be both experimental and expert-guided, ultimately a workbench is necessary to maintain, evaluate and refine the abstract model. We conclude with a pilot study, employing our approach, into aspects of noun phrase postmodifying clause structure. The data is analysed using the UNIT machine learning algorithm to search for significant interactions between domain variables. We show that our results are commensurable with those published in the linguistics literature, and discuss how the methodology may be improved.
Similar content being viewed by others
References
Aarts, B., Nelson, G., and Wallis, S.A. 1998. Using fuzzy tree fragments to explore English grammar. English Today, 14:52–56.
Abeille (ed.) 1999. Journées ATALA sur les corpus annotés pour la syntaxe—Treebanks workshop, Paris: ATALA.
Biber, D. 1988. Variation across speech and writing. Cambridge: Cambridge University Press.
Brill, E. 1992. A simple rule-based template tagger. In Proc. 3rd International Conference on Applied Natural Language Processing, Trento, Italy. Association for Computational Linguistics, New Jersey, pp. 152–155.
Briscoe, T. 1996. Robust Parsing. In Survey of The State of the Art in Human Language Technology (on-line document), R.A. Cole, J. Mariani, H. Uszkoreit, A. Zaenen, and V. Zoe (Eds.). http://cslu.cse. ogi.edu/HLTsurvey, (+/ch3node9.html).
Burnage, G. and Dunlop, D. 1992. Encoding the British National Corpus. In English Language Corpora: Design, Analysis and Exploitation, J. Aarts, P. de Haan, and N. Oostdijk (Eds.). Amsterdam: Rodopi.
Clark, P. and Niblett, T. 1989. The CN2 Induction Algorithm. Machine Learning, 3:263–283.
Corbridge, C., Rugg, G., Major, N.P., Shadbolt N.R., and Burton, A.M. 1994. Laddering: Technique and tool use in knowledge acquisition. Knowledge Acquisition, 6:315–341.
Cupit, J. and Shadbolt, N.R. 1996. Knowledge discovery in databases: Exploiting knowledge level redescription. In Advances in Knowledge Acquisition. Proc. 9th European Knowledge Acquisition Workshop, EKAW'; 96, N.R. Shadbolt, K. O'Hara, and G. Schreiber (Eds.). Berlin: Springer Verlag. pp. 245–261.
Fang, A.C. 1996. Automatically Generalising a Wide-Coverage Formal Grammar. In Synchronic Corpus Linguistics, C. Percy, C. Meyer, and I. Lancashire (Eds.). Amsterdam and Atlanta: Rodopi, pp. 131–146.
Fayyad, U. 1997. Editorial. Data Mining and Knowledge Discovery, 1:5–10.
Greenbaum, S. (Ed.) 1996. Comparing English Worldwide: The International Corpus of English. Oxford: Clarendon Press.
Haan, P. de. 1986. Exploring the Linguistic Database: Noun Phrase Complexity and LanguageVariation. In Corpus Linguistics and Beyond, W. Meijs (Ed.). Amsterdam: Rodopi, pp. 151–165.
Haberman, S.J. 1978. Analysis of Qualitative Data: Introductory Topics, Vol. 1, New York: Academic Press.
Huddleston, R. 1984. Introduction to the Grammar of English. Cambridge: Cambridge University Press.
Kondopoulou, P. 1997. Famine in Aigio: Analysing Memories of the Greek Famine. MA Dissertation, University of East London.
Lakatos, I. 1981. History of Science and Its Rational Reconstructions. In Scientific Revolutions, Oxford Readings in Philosophy, I. Hacking (Ed.) (1981). Oxford: OUP, pp. 107–127.
McEnery, A. and Wilson, A. 1996. Corpus Linguistics, Edinburgh: EUP.
Marcus, M., Kim, G., Marcinkiewicz, M.A., MacIntyre, R., Bies, M., Ferguson, M., Katz, K., and Schasberger, B. 1994. The Penn Treebank: Annotating predicate argument structure. In Proc. Human Language Technology Workshop. San Francisco: Morgan Kaufmann.
Mooney, R.J. 1997. Inductive logic programming for natural language processing. In Proc. 6th International Workshop on Inductive Logic Programming, Berlin: Springer Verlag, pp. 3–21.
Mooney, R.J. and Califf, M.E. 1995. Induction of first order decision lists: Results on learning the past tense of English verbs. Journal of Artificial Intelligence Research, 3:1–24.
Muggleton, S. 1998. Inductive logic programming: Issues, results and the LLL challenge (abstract). In Proc. 13th European Conference on Artificial Intelligence, ECAI-98. Chichester: John Wiley. p. 697.
Michie, D. 1986. On Machine Intelligence (2nd ed.). Chichester: Ellis Horwood.
Nelson, G., Wallis, S.A., and Aarts, B. in press. Exploring Natural Language: The British Component of the International Corpus of English. Amsterdam: John Benjamins.
Niblett, T. and Bratko, I. 1987. Learning decision rules in noisy domains. In Bramer, M.A. (Ed.). Research and Development in Expert Systems, 3:25–34.
Quinlan, J. R. 1993. C4.5: Programs for Machine Learning, San Mateo, CA: Morgan Kaufmann.
Quirk, R., Greenbaum, S., Leech, G., and Svartvik, J. 1985. A Comprehensive Grammar of the English Language. London: Longman.
Shaw, M. and Gaines, B. 1987. An interactive knowledge elicitation technique using personal construct technology. In Knowledge Acquisition for Expert Systems: A Practical Handbook. A. Kidd (Ed.). New York: Plenum Press.
Wallis, S.A. 1993. Machine learning with knowledge. In Proc. MLnet Workshop on Scientific Discovery 1993, MLnet, pp. 123–131.
Wallis, S.A. and Nelson, G. 1997. Syntactic parsing as a knowledge acquisition problem. In E. Plaza and R. Benjamins (Eds.). Knowledge Acquisition, Modeling and Management. Proc. 10th European Knowledge Acquisition Workshop, EKAW'; 97, Berlin: Springer Verlag. pp. 285–300.
Wallis, S.A., Aarts, B., and Nelson, G. 1999. Parsing in reverse—Exploring ICE-GB with Fuzzy Tree Fragments and ICECUP. In Corpora Galore, Papers from 19th Int. Conf. on English Language Research on Computerised Corpora, ICAME-98, J.M. Kirk (Ed.). Amsterdam: Rodopi, pp. 335–344.
Wallis, S.A. 1999a. Completing parsed corpora: From correction to evolution. In Abeille, 1999. pp. 7–12.
Wallis, S.A. 1999b. Machine Learning for Knowledge Discovery, PhD Thesis. University of Nottingham.
Wallis, S.A. and Nelson, G. 2000. Exploiting fuzzy tree fragments in the investigation of parsed corpora, Literary and Linguistic Computing, 15:339–361.
Wu, X. 1995. Knowledge Acquisition from Databases. Norwood, NJ: Ablex.
Zadeh, L. 1965. Fuzzy sets. Information and Control, 8:338–353.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Wallis, S., Nelson, G. Knowledge Discovery in Grammatically Analysed Corpora. Data Mining and Knowledge Discovery 5, 305–335 (2001). https://doi.org/10.1023/A:1011453128373
Issue Date:
DOI: https://doi.org/10.1023/A:1011453128373