Xtract: An overview

Smadja, Frank

doi:10.1007/BF00136983

Xtract: An overview

Published: December 1992

Volume 26, pages 399–413, (1992)
Cite this article

Computers and the Humanities Aims and scope Submit manuscript

Frank Smadja¹

75 Accesses
7 Citations
3 Altmetric
Explore all metrics

Abstract

Lexical collocations have particular statistical distributions. We have developed a set of statistical techniques for retrieving and identifying collocations from large textual corpora. The techniques we developed are able to identify collocations of arbitrary length as well as flexible collocations. These techniques have been implemented in a lexicographic tool, Xtract, which is able to automatically acquire collocations with high retrieval performance. Xtract works in three stages. The first stage is based on a statistical technique for identifying word pairs involved in a syntactic relation. The words can appear in the text in any order and can be separated by an arbitrary number of other words. The second stage is based on a technique to extract n-word collocations (or n-grams) in a much simpler way than related methods. These collocations can involve closed class words such as particles and prepositions. A third stage is then applied to the output of stage one and applies parsing techniques to sentences involving a given word pair in order to identify the proper syntactic relation between the two words. A secondary effect of the third stage is to filter out a number of candidate collocations as irrelevant and thus produce higher quality output. In this paper we present an overview of Xtract and we describe several uses for Xtract and the knowledge it retrieves such as language generation and machine translation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Abney, S. “Rapid Incremental Parsing with Repair.” In Waterloo Conference on Electronic Text Research. 1990.
Benson, M., E. Benson, and R. Ilson. The BBI Combinatory Dictionary of English: A Guide to Word Combinations. Amsterdam and Philadelphia: John Benjamins, 1986.
Google Scholar
Benson, M. “The Structure of the Collocational Dictionary.” International Journal of Lexicography, 2 (1989), 1–14.
Google Scholar
Benson, M. “Collocations and General-Purpose Dictionaries.” International Journal of Lexicography, 3, 1 (1990), 23–25.
Google Scholar
Choueka, Y. “Looking for Needles in a Haystack.” In Proceedings of the RIAO Conference on User-oriented Context Based Text and Image Handling. Cambridge, MA, 1988, pp. 609–23.
Church, K., and P. Hanks. “Word Association Norms, Mutual Information and Lexicography.” In Proceedings of the 27th meeting of the ACL. Association for Computational Linguistics, 1989, pp. 76–83. Also in Computational Linguistics, 16, 1 (1990).
Church, K., and P. Hanks. “Word Association Norms, Mutual Information and Lexicography.” Computational Linguistics, 16, 1 (1990), 22–29.
Google Scholar
Church, K.W., W. Gale, P. Hanks, and D. Hindle. “Parsing, Word Associations and Typical Predicate-Argument Relations.” In Proceedings of the International Workshop on Parsing Technologies. {cpCarnegie Mellon University, Pittsburgh, PA}, 1989. Also appears in Current Issues in Parsing Technology. Ed. Marasu Tomita. Boston, MA: Kluwer, 1991, pp. 103–12.
Google Scholar
Church, K., W. Gale, P. Hanks, and D. Hindle. “Using Statistics in Lexical Analysis. In Lexical Acquisition: Using on-line resources to build a lexicon. Ed. Uri Zernak. Lawrence Erlbaum, 1991. In press.
Church, K. “Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text.” In Proceedings of the Second Conference on Applied Natural Language Processing. Austin, Texas, 1988.
Cruse, D.A. Lexical Semantics. Cambridge: Cambridge University Press, 1986.
Google Scholar
Debili, F. Analyse Syntactico-Sémantique Fondée sur une Acquisition Automatique de Relations Lexicales Séman-tiques. PhD thesis (thèse de Doctorat d'État). Paris XI University, Orsay, France, 1982.
Google Scholar
Elhadad, M. “Types in Functional Unification Grammars.” In Proceedings of the 28th meeting of the ACL. Association for Computational Linguistics, 1990.
Flexner, S., ed. The Random House Dictionary of the English Language. 2nd ed. New York: Random House, 1987.
Google Scholar
Francis, W., and H. Kucera. Frequency Analysis of English Usage. Boston, MA: Houghton Mifflin Company, 1982.
Google Scholar
Gale, W., and K. Church. “A Program for Aligning Sentences in Bilingual Corpora.” In Proceedings of the 29th meeting of the ACL. Association for Computational Linguistics, 1991, pp. 177–84.
Halliday, M.A.K. “Lexis as a Linguistic Level.” In In Memory of J.R. Firth. London: Longmans, 1966, pp. 148–62.
Google Scholar
Hindle, D., and M. Rooth. “Structural Ambiguity and Lexical Relations.” In DARPA Speech and Natural Language Workshop. Hidden Valley, PA, June 1990.
Hindle, D. “User Manual for Fidditch, A Deterministic Parser.” Technical Memorandum 7590-142. Naval Research Laboratory, 1983.
Kay, M. “Functional Unification Grammars: A Formalism for Machine Translation.” In Proceedings of the 10th COLING. Stanford University, 1983, pp. 75–78.
Klingbiel, P.H. “Machine-Aided Indexing of Technical Literature.” Information Storage and Retrieval, 9 (1973), 79–84.
Google Scholar
Kukich, K. Knowledge-Based Report Generation: A Knowledge Engineering Approach to Natural Language Report Generation. PhD thesis. Information Science Department, University of Pittsburgh, 1983.
Maarek, Y., and F. Smadja. “Full Text Indexing Based on Lexical Relations, An Application: Software Libraries.” In Proceedings of ACM SIGIR. Cambridge, June 1989, pp. 198–206.
Maarek, Y.S., D.M. Berry, and G.E. Kaiser. “An Information Retrieval Approach for Automatically Constructing Software Libraries.” IEEE Transactions on Software Engineering. August, 1991.
Maarek, Y.S. “An Incremental Conceptual Clustering Algorithm with Input-Ordering Bias Correction.” In Advances in Artificial Intelligence, Natural Language and Knowledge Base Systems. Ed. M.C. Golumbic. Springer Verlag, 1990.
Martin, W.J.R., B.P.F. Al, and P.J.G. Van Sterkenburg. “On the Processing of a Text Corpus: From Textual Data to Lexicographical Information.” In Lexicography: Principles and Practice. Ed. R.R.K. Hartman. London: Academic Press, 1983.
Google Scholar
Quirk, R., S. Greenbaum, G. Leech, and J. Svartvik. A Comprehensive Grammar of the English Language. Longman, 1972.
Salton, G., and M.J. McGill. Introduction to Modern Information Retrieval. New York: McGraw Hill, 1983.
Google Scholar
Salton, J. Automatic Text Processing, The Transformation, Analysis, and Retrieval of Information by Computer. New York: Addison-Wesley, 1989.
Google Scholar
Smadja, F., and K. McKeown. “Automatically Extracting and Representing Collocations for Language Generation.” In Proceedings of the 28th Annual Meeting of the ACL. Pittsburgh, PA, June 1990.
Smadja, F., and K. McKeown. “Using Collocations for Language Generation.” Technical Report. Columbia University, NY, December 1991.
Google Scholar
Smadja, F. “Lexical Co-occurrence, The Missing Link in Language Acquisition.” In Program and Abstracts of the 15th International ALLC Conference of the Association for Literary and Linguistic Computing. Jerusalem, Israel, June 1988.
Smadja, F. “From N-Grams to Collocations an Evaluation of Xtract.” In Proceedings of the 29th Annual Meeting of the ACL. UC Berkeley, CA, June 1991.
Smadja, F. Retrieving Collocational Knowledge from Textual Corpora. An Application: Language Generation. PhD thesis. Computer Science Department, Columbia University, New York, NY, 1991.
Google Scholar
Sparck Jones, K., and J.I. Tait. “Automatic Search Variant Generation.” Journal of Documentation, 40, 1 (1984), 50–66.
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Department, Columbia University, 10027, New York, NY, USA
Frank Smadja

Authors

Frank Smadja
View author publications
You can also search for this author in PubMed Google Scholar

Additional information

Frank Smadja is in the Department of Computer Science at Columbia University and has been working on lexical collocations for his doctoral thesis.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Smadja, F. Xtract: An overview. Comput Hum 26, 399–413 (1992). https://doi.org/10.1007/BF00136983

Download citation

Issue Date: December 1992
DOI: https://doi.org/10.1007/BF00136983

Key Words

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Xtract: An overview

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

A survey on large language model based autonomous agents

Natural Language Processing

References

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Key Words

Navigation

Xtract: An overview

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

A survey on large language model based autonomous agents

Natural Language Processing

References

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Key Words

Search

Navigation