Skip to main content
Log in

Xtract: An overview

  • Published:
Computers and the Humanities Aims and scope Submit manuscript

Abstract

Lexical collocations have particular statistical distributions. We have developed a set of statistical techniques for retrieving and identifying collocations from large textual corpora. The techniques we developed are able to identify collocations of arbitrary length as well as flexible collocations. These techniques have been implemented in a lexicographic tool, Xtract, which is able to automatically acquire collocations with high retrieval performance. Xtract works in three stages. The first stage is based on a statistical technique for identifying word pairs involved in a syntactic relation. The words can appear in the text in any order and can be separated by an arbitrary number of other words. The second stage is based on a technique to extract n-word collocations (or n-grams) in a much simpler way than related methods. These collocations can involve closed class words such as particles and prepositions. A third stage is then applied to the output of stage one and applies parsing techniques to sentences involving a given word pair in order to identify the proper syntactic relation between the two words. A secondary effect of the third stage is to filter out a number of candidate collocations as irrelevant and thus produce higher quality output. In this paper we present an overview of Xtract and we describe several uses for Xtract and the knowledge it retrieves such as language generation and machine translation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Abney, S. “Rapid Incremental Parsing with Repair.” In Waterloo Conference on Electronic Text Research. 1990.

  • Benson, M., E. Benson, and R. Ilson. The BBI Combinatory Dictionary of English: A Guide to Word Combinations. Amsterdam and Philadelphia: John Benjamins, 1986.

    Google Scholar 

  • Benson, M. “The Structure of the Collocational Dictionary.” International Journal of Lexicography, 2 (1989), 1–14.

    Google Scholar 

  • Benson, M. “Collocations and General-Purpose Dictionaries.” International Journal of Lexicography, 3, 1 (1990), 23–25.

    Google Scholar 

  • Choueka, Y. “Looking for Needles in a Haystack.” In Proceedings of the RIAO Conference on User-oriented Context Based Text and Image Handling. Cambridge, MA, 1988, pp. 609–23.

  • Church, K., and P. Hanks. “Word Association Norms, Mutual Information and Lexicography.” In Proceedings of the 27th meeting of the ACL. Association for Computational Linguistics, 1989, pp. 76–83. Also in Computational Linguistics, 16, 1 (1990).

  • Church, K., and P. Hanks. “Word Association Norms, Mutual Information and Lexicography.” Computational Linguistics, 16, 1 (1990), 22–29.

    Google Scholar 

  • Church, K.W., W. Gale, P. Hanks, and D. Hindle. “Parsing, Word Associations and Typical Predicate-Argument Relations.” In Proceedings of the International Workshop on Parsing Technologies. {cpCarnegie Mellon University, Pittsburgh, PA}, 1989. Also appears in Current Issues in Parsing Technology. Ed. Marasu Tomita. Boston, MA: Kluwer, 1991, pp. 103–12.

    Google Scholar 

  • Church, K., W. Gale, P. Hanks, and D. Hindle. “Using Statistics in Lexical Analysis. In Lexical Acquisition: Using on-line resources to build a lexicon. Ed. Uri Zernak. Lawrence Erlbaum, 1991. In press.

  • Church, K. “Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text.” In Proceedings of the Second Conference on Applied Natural Language Processing. Austin, Texas, 1988.

  • Cruse, D.A. Lexical Semantics. Cambridge: Cambridge University Press, 1986.

    Google Scholar 

  • Debili, F. Analyse Syntactico-Sémantique Fondée sur une Acquisition Automatique de Relations Lexicales Séman-tiques. PhD thesis (thèse de Doctorat d'État). Paris XI University, Orsay, France, 1982.

    Google Scholar 

  • Elhadad, M. “Types in Functional Unification Grammars.” In Proceedings of the 28th meeting of the ACL. Association for Computational Linguistics, 1990.

  • Flexner, S., ed. The Random House Dictionary of the English Language. 2nd ed. New York: Random House, 1987.

    Google Scholar 

  • Francis, W., and H. Kucera. Frequency Analysis of English Usage. Boston, MA: Houghton Mifflin Company, 1982.

    Google Scholar 

  • Gale, W., and K. Church. “A Program for Aligning Sentences in Bilingual Corpora.” In Proceedings of the 29th meeting of the ACL. Association for Computational Linguistics, 1991, pp. 177–84.

  • Halliday, M.A.K. “Lexis as a Linguistic Level.” In In Memory of J.R. Firth. London: Longmans, 1966, pp. 148–62.

    Google Scholar 

  • Hindle, D., and M. Rooth. “Structural Ambiguity and Lexical Relations.” In DARPA Speech and Natural Language Workshop. Hidden Valley, PA, June 1990.

  • Hindle, D. “User Manual for Fidditch, A Deterministic Parser.” Technical Memorandum 7590-142. Naval Research Laboratory, 1983.

  • Kay, M. “Functional Unification Grammars: A Formalism for Machine Translation.” In Proceedings of the 10th COLING. Stanford University, 1983, pp. 75–78.

  • Klingbiel, P.H. “Machine-Aided Indexing of Technical Literature.” Information Storage and Retrieval, 9 (1973), 79–84.

    Google Scholar 

  • Kukich, K. Knowledge-Based Report Generation: A Knowledge Engineering Approach to Natural Language Report Generation. PhD thesis. Information Science Department, University of Pittsburgh, 1983.

  • Maarek, Y., and F. Smadja. “Full Text Indexing Based on Lexical Relations, An Application: Software Libraries.” In Proceedings of ACM SIGIR. Cambridge, June 1989, pp. 198–206.

  • Maarek, Y.S., D.M. Berry, and G.E. Kaiser. “An Information Retrieval Approach for Automatically Constructing Software Libraries.” IEEE Transactions on Software Engineering. August, 1991.

  • Maarek, Y.S. “An Incremental Conceptual Clustering Algorithm with Input-Ordering Bias Correction.” In Advances in Artificial Intelligence, Natural Language and Knowledge Base Systems. Ed. M.C. Golumbic. Springer Verlag, 1990.

  • Martin, W.J.R., B.P.F. Al, and P.J.G. Van Sterkenburg. “On the Processing of a Text Corpus: From Textual Data to Lexicographical Information.” In Lexicography: Principles and Practice. Ed. R.R.K. Hartman. London: Academic Press, 1983.

    Google Scholar 

  • Quirk, R., S. Greenbaum, G. Leech, and J. Svartvik. A Comprehensive Grammar of the English Language. Longman, 1972.

  • Salton, G., and M.J. McGill. Introduction to Modern Information Retrieval. New York: McGraw Hill, 1983.

    Google Scholar 

  • Salton, J. Automatic Text Processing, The Transformation, Analysis, and Retrieval of Information by Computer. New York: Addison-Wesley, 1989.

    Google Scholar 

  • Smadja, F., and K. McKeown. “Automatically Extracting and Representing Collocations for Language Generation.” In Proceedings of the 28th Annual Meeting of the ACL. Pittsburgh, PA, June 1990.

  • Smadja, F., and K. McKeown. “Using Collocations for Language Generation.” Technical Report. Columbia University, NY, December 1991.

    Google Scholar 

  • Smadja, F. “Lexical Co-occurrence, The Missing Link in Language Acquisition.” In Program and Abstracts of the 15th International ALLC Conference of the Association for Literary and Linguistic Computing. Jerusalem, Israel, June 1988.

  • Smadja, F. “From N-Grams to Collocations an Evaluation of Xtract.” In Proceedings of the 29th Annual Meeting of the ACL. UC Berkeley, CA, June 1991.

  • Smadja, F. Retrieving Collocational Knowledge from Textual Corpora. An Application: Language Generation. PhD thesis. Computer Science Department, Columbia University, New York, NY, 1991.

    Google Scholar 

  • Sparck Jones, K., and J.I. Tait. “Automatic Search Variant Generation.” Journal of Documentation, 40, 1 (1984), 50–66.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Additional information

Frank Smadja is in the Department of Computer Science at Columbia University and has been working on lexical collocations for his doctoral thesis.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Smadja, F. Xtract: An overview. Comput Hum 26, 399–413 (1992). https://doi.org/10.1007/BF00136983

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF00136983

Key Words

Navigation