Natural Language Information Retrieval pp 215-259 | Cite as
Phrasal Terms in Real-World IR Applications
Abstract
In this chapter we report our investigation on one important issue in the real-world IR environment, i.e., the usefulness, extraction and usage of phrasal terms. One large-scale empirical study has provided supporting evidence that phrasal terms can improve retrieval effectiveness, especially when their relative proximity information is understood from the naturally running text. To automatically identify significant terms for a predefined topic, we have adopted a “gaining data from data” approach. The algorithm learns to select candidate terms through a meaningful comparison of a focused sample with a large and diverse base sample. When investigating whether the identified terms can be useful for other IR applications, we applied these knowledge resources for document summarization and classification. The initial results look quite promising.
Keywords
Domain Expert Base Sample Reasonable Doubt Medical Malpractice Focus SamplePreview
Unable to display preview. Download preview PDF.
References
- Brandow, R., Mitze, K., and Rao, L. (1995). Automatic condensation of electronic publications by sentence selection. Information Processing und Management, 31 (5), pp. 675–685.CrossRefGoogle Scholar
- Breidt, E. (1993). Extraction of V-N collocations from text corpora: A feasibility study for German. In Proceedings of the First Workshop on Very Large Corpora, Columbus, OH. Association for Computational Linguistics, pp. 74–83.Google Scholar
- Choueka, Y. (1988). Looking for needles in a haystack. In Proceedings, RIAO, Conference on User-Oriented Context-Based Text and Image Handling, Cambridge, MA. pp. 609–623.Google Scholar
- Church, K. et al. (1991). Using statistics in lexical analysis. In Zernik, U., editor, Lexical Acquisition: Exploring On-line Resources to Build a Lexicon. Lawrence Erlbaum Association.Google Scholar
- Church, K. and Hanks, P. (1990). Word association norms, mutual information and lexicography. Computational Linguistics, 16 (1), pp. 22–29.Google Scholar
- Church, K. and Mercer, R. (1993). Introduction to the special issue in computational linguistics using large corpora. Computational Linguistics, 19 (1), pp. 1–24.Google Scholar
- Cohen, W. and Singer, Y. (1996). Context-sensitive learning methods for text catego-rization. In Proceedings of ACM SIGIR-96.Google Scholar
- Dagan, I. and Church, K. (1994). Termight: Identifying and translating technical terminology. In Proceedings of the Fifth Conference on Applied Natural Language Processing, Stuttgart, pp. 34–40.Google Scholar
- Evans, D. and Zhai, C. (1996). Noun-phrase analysis in unrestricted text for information retrieval. In Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, Santa Cruz, CA, pp. 17–24.Google Scholar
- Frantzi, K. and Ananiadou, S. (1996). Extracting nested collocations. In Proceedings of COLING-96, Copenhagen, Denmark, pp. 41–46.Google Scholar
- Gierl, C. and Frost, D. (1992). Identification of domain-specific terminology by combining mutual information and lexical induction. In Neumann, B., editor, 10th European Conference on Artificial Intelligence, pp. 564–566.Google Scholar
- Hamill, K. and Zamora, A. (1980). The use of titles for automatic document classification. Journal of the ASIS, 31 (6).Google Scholar
- Hindle, D. (1990). Noun classification from predicate-argument structures. In Proceedings of the 28th Annual Meeting of the Association for Computational Linguistics, Pittsburgh, PA, pp. 268–275.Google Scholar
- Humphrey, T. and Zhou, J. (1989). Period disambiguation using a neural network. In Proceedings of the International Joint Conference on Neural Networks, Washington, DC, pp. 606–616.Google Scholar
- Justeson, S. and Katz, S. (1993). Technical terminology: some linguistic properties and an algorithm for identification in text. Research report, IBM Research Division, T. J. Watson Research Center.Google Scholar
- Lewis, D. (1992). Representation and Learning in Information Retrieval. PhD thesis, University of Massachusetts.Google Scholar
- Lewis, D. and Gale, W. (1994). A sequential algorithm for training text classifiers. In Proceedings of ACM SIGIR-94, pp. 3–12.Google Scholar
- McKeown, K., Robin, J., and Kukich, K. (1995). Generating concise natural language summaries. Information Processing ê4 Management, 31 (5), pp. 703–733.CrossRefGoogle Scholar
- Roochnik, P. et al. (1994). Innovations in multilingual name searching. In Proceedings of RIAO-94,New York, NY.Google Scholar
- Salton, G. et al. (1990). A simple syntactic approach for the generation of indexing phrases. Technical Report 90–1137, Cornell University, Department of Computer Science.Google Scholar
- Smadja, F. (1993). Retrieving collocations from text: Xtract. Computational Linguistics, 19 (1), pp. 143–178.Google Scholar
- Smeaton, A. (1992). Progress in the application of natural language processing to information retrieval tasks. Computer, 35 (3), pp. 268–278.CrossRefGoogle Scholar
- Sparck Jones, K. (1990). What exactly should we look to AI, and NLP especially, for? Working note, AAAI Spring Symposium on Text Based Intelligent Systems.Google Scholar
- Sparck Jones, K. and Endres-Niggermeyer, B. (1995). Automatic summarization. Infor-mation Processing é4 Management, 31 (5), pp. 625–630.CrossRefGoogle Scholar
- Steier, A. and Belew, R. (1994). Exploring phrases: a statistical analysis of topical language. Technical report, University of California, San Diego.Google Scholar
- Strzalkowski, T. (1994a). Building a lexical domain map from text corpora. In Proceedings of COLING-94, Kyoto, Japan, pp. 604–610Google Scholar
- Strzalkowski, T. (1994b). Document indexing and retrieval using natural language processing. In Proceedings of RIAO-94, New York, NY, pp. 131–145.Google Scholar
- Strzalkowski, T. et al. (1996). Natural language information retrieval: TREC-5 report. In Proceedings of Text Retrieval Conference (TREC-5). NIST Special Publication, 500–226, pp. 291–314.Google Scholar
- Wiener, E., Pedersen, J., and Weigend, A. (1995). A neural network approach to topic spotting. In Symposium on Document Analysis and Information Retrieval.Google Scholar
- Zechner, K. (1996). Fast generation of abstracts from general domain text corpora by extracting relevant sentences. In Proceedings of COLING-96, Copenhagen, Denmark, pp. 986–989.Google Scholar
- Allan, J. J. Callan, B. Croft, L. Ballesteros, D. Byrd, R. Swan, J. Xu. (1998). “IN-QUERY Does Battle with TREC-6.” Proceedings of the 6th Message Understanding Google Scholar
- Conference,Morgan-Kaufmann Publishers, San Francisco, CA. pp. 169–206.Google Scholar
- DARPA. (1995). Proceedings of the 6th Message Understanding Conference, Morgan-Kaufmann Publishers, San Francisco, CA.Google Scholar
- DARPA. (1996). Tipster Text Phase 2: 24 month Conference, Morgan-Kaufmann Publishers, San Francisco, CA.Google Scholar
- Evans, David, and Robert G. Lefferts. (1994). “Design and Evaluation of the CLARITTREC-2 System.” Proceedings of the Second Text REtrieval Conference (TREC-2), NIST Special Publication 500–215, National Institute of Standards and Technology, Gaithersburg, MD. pp. 137–150.Google Scholar
- Fagan, Joel L. (1987). Experiments in Automated Phrase Indexing for Document Retrieval: A Comparison of Syntactic and Non-Syntactic Methods. Ph.D. Thesis, Department of Computer Science, Cornell University.Google Scholar
- Guthrie, L, T. Strzalkowski, F. Lin and J. Wang. (1996). “Integration of Document Detection and Information Extraction.” In Proceedings of Tipster Phase II, Morgan-Kaufmann Publishers, San Francisco, CA. pp. 195–200.Google Scholar
- Harman, Donna K. (ed.). (1998). The Sixth Text REtrieval Conference (TREC-6). NIST Special Publication 500–240, National Institute of Standards and Technology, Gaithersburg, MD.Google Scholar
- Lewis, David D. and W. Bruce Croft. (1990). “Term Clustering of Syntactic Phrases”. Proceedings of ACM SIGIR-90, pp. 385–405.Google Scholar
- Mauldin, Michael. (1991). “Retrieval Performance in Ferret: A Conceptual Information Retrieval System.” Proceedings of ACM SIGIR-91, pp. 347–355.Google Scholar
- Metzler, Douglas P., Stephanie W. Haas, Cynthia L. Cosic, and Leslie H. Wheeler. (1989). “Constituent Object Parsing for Information Retrieval and Similar Text Processing Problems.” Journal of the ASIS, 40 (6), pp. 398–423.Google Scholar
- Mitra, Mandar, Chris Buckley, Amit Singhal, and Claire Cardie. (1997). “An Analysis of Statistical and Syntactic Phrases.” Proceedings of RIAO-97 Conference, Centre de Hautes Etudes Internationales d’Informatique Ducumentaires, pp. 200–214.Google Scholar
- Smeaton, Alan F. and C. J. van Rijsbergen. (1988). “Experiments on incorporating syntactic processing of user queries into a document retrieval strategy.” Proceedings of ACM SIGIR-88, pp. 31–51.Google Scholar
- Sparck Jones, K. and J. I. Tait. (1984). “Automatic search term variant generation.” Journal of Documentation, 40 (1), pp. 50–66.CrossRefGoogle Scholar
- Sparck Jones, K. (1998). “What is the Role of NLP in Text Retrieval?” This volume. Strzalkowski, Tomek. (1995). “Natural Language Information Retrieval” Information Processing and Management,31(3), pp. 397–417, Pergamon/Elsevier.Google Scholar
- Strzalkowski, Tomek, Fang Lin and Jose Perez-Carballo. (1998). “Natural Language Information Retrieval: TREC-6 Report.” Proceedings of The Sixth Text Retrieval Conference (TREC-6), NIST Special Publication 500–240. pp. 347–366.Google Scholar