Knowledge Discovery in Grammatically Analysed Corpora

Wallis, Sean; Nelson, Gerald

doi:10.1023/A:1011453128373

Knowledge Discovery in Grammatically Analysed Corpora

Published: October 2001

Volume 5, pages 305–335, (2001)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Sean Wallis¹ &
Gerald Nelson²

382 Accesses
16 Citations
Explore all metrics

Abstract

Collections of grammatically annotated texts (corpora), and in particular, parsed corpora, present a challenge to current methods of analysis. Such corpora are large and highly structured heterogeneous data sources. In this paper we briefly describe the parsed one-million word ICE-GB corpus, and the ICECUP query system. We then consider the application of knowledge discovery in databases (KDD) to text corpora. Following Cupit and Shadbolt (Proceedings 9th European Knowledge Acquisition Workshop, EKAW '96; Berlin: Springer Verlag, pp. 245–261, 1996), we argue that effective linguistic knowledge discovery must be based on a process of redescription or, more precisely, abstraction, based on the research question to be investigated. Abstraction maps relevant elements from the corpus to an abstract model of the research topic. This mapping may be implemented using a grammatical query representation such as ICECUP's Fuzzy Tree Fragments (FTFs). Since this abstractive process must be both experimental and expert-guided, ultimately a workbench is necessary to maintain, evaluate and refine the abstract model. We conclude with a pilot study, employing our approach, into aspects of noun phrase postmodifying clause structure. The data is analysed using the UNIT machine learning algorithm to search for significant interactions between domain variables. We show that our results are commensurable with those published in the linguistics literature, and discuss how the methodology may be improved.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Aarts, B., Nelson, G., and Wallis, S.A. 1998. Using fuzzy tree fragments to explore English grammar. English Today, 14:52–56.
Google Scholar
Abeille (ed.) 1999. Journées ATALA sur les corpus annotés pour la syntaxe—Treebanks workshop, Paris: ATALA.
Google Scholar
Biber, D. 1988. Variation across speech and writing. Cambridge: Cambridge University Press.
Google Scholar
Brill, E. 1992. A simple rule-based template tagger. In Proc. 3rd International Conference on Applied Natural Language Processing, Trento, Italy. Association for Computational Linguistics, New Jersey, pp. 152–155.
Google Scholar
Briscoe, T. 1996. Robust Parsing. In Survey of The State of the Art in Human Language Technology (on-line document), R.A. Cole, J. Mariani, H. Uszkoreit, A. Zaenen, and V. Zoe (Eds.). http://cslu.cse. ogi.edu/HLTsurvey, (+/ch3node9.html).
Burnage, G. and Dunlop, D. 1992. Encoding the British National Corpus. In English Language Corpora: Design, Analysis and Exploitation, J. Aarts, P. de Haan, and N. Oostdijk (Eds.). Amsterdam: Rodopi.
Google Scholar
Clark, P. and Niblett, T. 1989. The CN2 Induction Algorithm. Machine Learning, 3:263–283.
Google Scholar
Corbridge, C., Rugg, G., Major, N.P., Shadbolt N.R., and Burton, A.M. 1994. Laddering: Technique and tool use in knowledge acquisition. Knowledge Acquisition, 6:315–341.
Google Scholar
Cupit, J. and Shadbolt, N.R. 1996. Knowledge discovery in databases: Exploiting knowledge level redescription. In Advances in Knowledge Acquisition. Proc. 9th European Knowledge Acquisition Workshop, EKAW'; 96, N.R. Shadbolt, K. O'Hara, and G. Schreiber (Eds.). Berlin: Springer Verlag. pp. 245–261.
Google Scholar
Fang, A.C. 1996. Automatically Generalising a Wide-Coverage Formal Grammar. In Synchronic Corpus Linguistics, C. Percy, C. Meyer, and I. Lancashire (Eds.). Amsterdam and Atlanta: Rodopi, pp. 131–146.
Google Scholar
Fayyad, U. 1997. Editorial. Data Mining and Knowledge Discovery, 1:5–10.
Google Scholar
Greenbaum, S. (Ed.) 1996. Comparing English Worldwide: The International Corpus of English. Oxford: Clarendon Press.
Google Scholar
Haan, P. de. 1986. Exploring the Linguistic Database: Noun Phrase Complexity and LanguageVariation. In Corpus Linguistics and Beyond, W. Meijs (Ed.). Amsterdam: Rodopi, pp. 151–165.
Google Scholar
Haberman, S.J. 1978. Analysis of Qualitative Data: Introductory Topics, Vol. 1, New York: Academic Press.
Google Scholar
Huddleston, R. 1984. Introduction to the Grammar of English. Cambridge: Cambridge University Press.
Google Scholar
Kondopoulou, P. 1997. Famine in Aigio: Analysing Memories of the Greek Famine. MA Dissertation, University of East London.
Lakatos, I. 1981. History of Science and Its Rational Reconstructions. In Scientific Revolutions, Oxford Readings in Philosophy, I. Hacking (Ed.) (1981). Oxford: OUP, pp. 107–127.
Google Scholar
McEnery, A. and Wilson, A. 1996. Corpus Linguistics, Edinburgh: EUP.
Google Scholar
Marcus, M., Kim, G., Marcinkiewicz, M.A., MacIntyre, R., Bies, M., Ferguson, M., Katz, K., and Schasberger, B. 1994. The Penn Treebank: Annotating predicate argument structure. In Proc. Human Language Technology Workshop. San Francisco: Morgan Kaufmann.
Google Scholar
Mooney, R.J. 1997. Inductive logic programming for natural language processing. In Proc. 6th International Workshop on Inductive Logic Programming, Berlin: Springer Verlag, pp. 3–21.
Google Scholar
Mooney, R.J. and Califf, M.E. 1995. Induction of first order decision lists: Results on learning the past tense of English verbs. Journal of Artificial Intelligence Research, 3:1–24.
Google Scholar
Muggleton, S. 1998. Inductive logic programming: Issues, results and the LLL challenge (abstract). In Proc. 13th European Conference on Artificial Intelligence, ECAI-98. Chichester: John Wiley. p. 697.
Google Scholar
Michie, D. 1986. On Machine Intelligence (2nd ed.). Chichester: Ellis Horwood.
Google Scholar
Nelson, G., Wallis, S.A., and Aarts, B. in press. Exploring Natural Language: The British Component of the International Corpus of English. Amsterdam: John Benjamins.
Niblett, T. and Bratko, I. 1987. Learning decision rules in noisy domains. In Bramer, M.A. (Ed.). Research and Development in Expert Systems, 3:25–34.
Quinlan, J. R. 1993. C4.5: Programs for Machine Learning, San Mateo, CA: Morgan Kaufmann.
Google Scholar
Quirk, R., Greenbaum, S., Leech, G., and Svartvik, J. 1985. A Comprehensive Grammar of the English Language. London: Longman.
Google Scholar
Shaw, M. and Gaines, B. 1987. An interactive knowledge elicitation technique using personal construct technology. In Knowledge Acquisition for Expert Systems: A Practical Handbook. A. Kidd (Ed.). New York: Plenum Press.
Google Scholar
Wallis, S.A. 1993. Machine learning with knowledge. In Proc. MLnet Workshop on Scientific Discovery 1993, MLnet, pp. 123–131.
Wallis, S.A. and Nelson, G. 1997. Syntactic parsing as a knowledge acquisition problem. In E. Plaza and R. Benjamins (Eds.). Knowledge Acquisition, Modeling and Management. Proc. 10th European Knowledge Acquisition Workshop, EKAW'; 97, Berlin: Springer Verlag. pp. 285–300.
Google Scholar
Wallis, S.A., Aarts, B., and Nelson, G. 1999. Parsing in reverse—Exploring ICE-GB with Fuzzy Tree Fragments and ICECUP. In Corpora Galore, Papers from 19th Int. Conf. on English Language Research on Computerised Corpora, ICAME-98, J.M. Kirk (Ed.). Amsterdam: Rodopi, pp. 335–344.
Google Scholar
Wallis, S.A. 1999a. Completing parsed corpora: From correction to evolution. In Abeille, 1999. pp. 7–12.
Wallis, S.A. 1999b. Machine Learning for Knowledge Discovery, PhD Thesis. University of Nottingham.
Wallis, S.A. and Nelson, G. 2000. Exploiting fuzzy tree fragments in the investigation of parsed corpora, Literary and Linguistic Computing, 15:339–361.
Google Scholar
Wu, X. 1995. Knowledge Acquisition from Databases. Norwood, NJ: Ablex.
Google Scholar
Zadeh, L. 1965. Fuzzy sets. Information and Control, 8:338–353.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of English, University of HongKong, HongKong
Sean Wallis
Survey of English Usage, University College, London, UK
Gerald Nelson

Authors

Sean Wallis
View author publications
You can also search for this author in PubMed Google Scholar
Gerald Nelson
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wallis, S., Nelson, G. Knowledge Discovery in Grammatically Analysed Corpora. Data Mining and Knowledge Discovery 5, 305–335 (2001). https://doi.org/10.1023/A:1011453128373

Download citation

Issue Date: October 2001
DOI: https://doi.org/10.1023/A:1011453128373

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Knowledge Discovery in Grammatically Analysed Corpora

Abstract

Access this article

Similar content being viewed by others

Knowledge Acquisition from Natural Language with Treebank Semantics and $$\mathcal{F}\textsc {lora}$$ -2

Representation, Analysis, and Extraction of Knowledge from Unstructured Natural Language Texts

Rule Induction and Reasoning over Knowledge Graphs

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Knowledge Discovery in Grammatically Analysed Corpora

Abstract

Access this article

Similar content being viewed by others

Knowledge Acquisition from Natural Language with Treebank Semantics and $$\mathcal{F}\textsc {lora}$$ -2

Representation, Analysis, and Extraction of Knowledge from Unstructured Natural Language Texts

Rule Induction and Reasoning over Knowledge Graphs

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation