Phrasal Terms in Real-World IR Applications

  • Joe Zhou
Part of the Text, Speech and Language Technology book series (TLTB, volume 7)

Abstract

In this chapter we report our investigation on one important issue in the real-world IR environment, i.e., the usefulness, extraction and usage of phrasal terms. One large-scale empirical study has provided supporting evidence that phrasal terms can improve retrieval effectiveness, especially when their relative proximity information is understood from the naturally running text. To automatically identify significant terms for a predefined topic, we have adopted a “gaining data from data” approach. The algorithm learns to select candidate terms through a meaningful comparison of a focused sample with a large and diverse base sample. When investigating whether the identified terms can be useful for other IR applications, we applied these knowledge resources for document summarization and classification. The initial results look quite promising.

Keywords

Domain Expert Base Sample Reasonable Doubt Medical Malpractice Focus Sample 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Brandow, R., Mitze, K., and Rao, L. (1995). Automatic condensation of electronic publications by sentence selection. Information Processing und Management, 31 (5), pp. 675–685.CrossRefGoogle Scholar
  2. Breidt, E. (1993). Extraction of V-N collocations from text corpora: A feasibility study for German. In Proceedings of the First Workshop on Very Large Corpora, Columbus, OH. Association for Computational Linguistics, pp. 74–83.Google Scholar
  3. Choueka, Y. (1988). Looking for needles in a haystack. In Proceedings, RIAO, Conference on User-Oriented Context-Based Text and Image Handling, Cambridge, MA. pp. 609–623.Google Scholar
  4. Church, K. et al. (1991). Using statistics in lexical analysis. In Zernik, U., editor, Lexical Acquisition: Exploring On-line Resources to Build a Lexicon. Lawrence Erlbaum Association.Google Scholar
  5. Church, K. and Hanks, P. (1990). Word association norms, mutual information and lexicography. Computational Linguistics, 16 (1), pp. 22–29.Google Scholar
  6. Church, K. and Mercer, R. (1993). Introduction to the special issue in computational linguistics using large corpora. Computational Linguistics, 19 (1), pp. 1–24.Google Scholar
  7. Cohen, W. and Singer, Y. (1996). Context-sensitive learning methods for text catego-rization. In Proceedings of ACM SIGIR-96.Google Scholar
  8. Dagan, I. and Church, K. (1994). Termight: Identifying and translating technical terminology. In Proceedings of the Fifth Conference on Applied Natural Language Processing, Stuttgart, pp. 34–40.Google Scholar
  9. Evans, D. and Zhai, C. (1996). Noun-phrase analysis in unrestricted text for information retrieval. In Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, Santa Cruz, CA, pp. 17–24.Google Scholar
  10. Frantzi, K. and Ananiadou, S. (1996). Extracting nested collocations. In Proceedings of COLING-96, Copenhagen, Denmark, pp. 41–46.Google Scholar
  11. Gierl, C. and Frost, D. (1992). Identification of domain-specific terminology by combining mutual information and lexical induction. In Neumann, B., editor, 10th European Conference on Artificial Intelligence, pp. 564–566.Google Scholar
  12. Hamill, K. and Zamora, A. (1980). The use of titles for automatic document classification. Journal of the ASIS, 31 (6).Google Scholar
  13. Hindle, D. (1990). Noun classification from predicate-argument structures. In Proceedings of the 28th Annual Meeting of the Association for Computational Linguistics, Pittsburgh, PA, pp. 268–275.Google Scholar
  14. Humphrey, T. and Zhou, J. (1989). Period disambiguation using a neural network. In Proceedings of the International Joint Conference on Neural Networks, Washington, DC, pp. 606–616.Google Scholar
  15. Justeson, S. and Katz, S. (1993). Technical terminology: some linguistic properties and an algorithm for identification in text. Research report, IBM Research Division, T. J. Watson Research Center.Google Scholar
  16. Lewis, D. (1992). Representation and Learning in Information Retrieval. PhD thesis, University of Massachusetts.Google Scholar
  17. Lewis, D. and Gale, W. (1994). A sequential algorithm for training text classifiers. In Proceedings of ACM SIGIR-94, pp. 3–12.Google Scholar
  18. McKeown, K., Robin, J., and Kukich, K. (1995). Generating concise natural language summaries. Information Processing ê4 Management, 31 (5), pp. 703–733.CrossRefGoogle Scholar
  19. Roochnik, P. et al. (1994). Innovations in multilingual name searching. In Proceedings of RIAO-94,New York, NY.Google Scholar
  20. Salton, G. et al. (1990). A simple syntactic approach for the generation of indexing phrases. Technical Report 90–1137, Cornell University, Department of Computer Science.Google Scholar
  21. Smadja, F. (1993). Retrieving collocations from text: Xtract. Computational Linguistics, 19 (1), pp. 143–178.Google Scholar
  22. Smeaton, A. (1992). Progress in the application of natural language processing to information retrieval tasks. Computer, 35 (3), pp. 268–278.CrossRefGoogle Scholar
  23. Sparck Jones, K. (1990). What exactly should we look to AI, and NLP especially, for? Working note, AAAI Spring Symposium on Text Based Intelligent Systems.Google Scholar
  24. Sparck Jones, K. and Endres-Niggermeyer, B. (1995). Automatic summarization. Infor-mation Processing é4 Management, 31 (5), pp. 625–630.CrossRefGoogle Scholar
  25. Steier, A. and Belew, R. (1994). Exploring phrases: a statistical analysis of topical language. Technical report, University of California, San Diego.Google Scholar
  26. Strzalkowski, T. (1994a). Building a lexical domain map from text corpora. In Proceedings of COLING-94, Kyoto, Japan, pp. 604–610Google Scholar
  27. Strzalkowski, T. (1994b). Document indexing and retrieval using natural language processing. In Proceedings of RIAO-94, New York, NY, pp. 131–145.Google Scholar
  28. Strzalkowski, T. et al. (1996). Natural language information retrieval: TREC-5 report. In Proceedings of Text Retrieval Conference (TREC-5). NIST Special Publication, 500–226, pp. 291–314.Google Scholar
  29. Wiener, E., Pedersen, J., and Weigend, A. (1995). A neural network approach to topic spotting. In Symposium on Document Analysis and Information Retrieval.Google Scholar
  30. Zechner, K. (1996). Fast generation of abstracts from general domain text corpora by extracting relevant sentences. In Proceedings of COLING-96, Copenhagen, Denmark, pp. 986–989.Google Scholar
  31. Allan, J. J. Callan, B. Croft, L. Ballesteros, D. Byrd, R. Swan, J. Xu. (1998). “IN-QUERY Does Battle with TREC-6.” Proceedings of the 6th Message Understanding Google Scholar
  32. Conference,Morgan-Kaufmann Publishers, San Francisco, CA. pp. 169–206.Google Scholar
  33. DARPA. (1995). Proceedings of the 6th Message Understanding Conference, Morgan-Kaufmann Publishers, San Francisco, CA.Google Scholar
  34. DARPA. (1996). Tipster Text Phase 2: 24 month Conference, Morgan-Kaufmann Publishers, San Francisco, CA.Google Scholar
  35. Evans, David, and Robert G. Lefferts. (1994). “Design and Evaluation of the CLARITTREC-2 System.” Proceedings of the Second Text REtrieval Conference (TREC-2), NIST Special Publication 500–215, National Institute of Standards and Technology, Gaithersburg, MD. pp. 137–150.Google Scholar
  36. Fagan, Joel L. (1987). Experiments in Automated Phrase Indexing for Document Retrieval: A Comparison of Syntactic and Non-Syntactic Methods. Ph.D. Thesis, Department of Computer Science, Cornell University.Google Scholar
  37. Guthrie, L, T. Strzalkowski, F. Lin and J. Wang. (1996). “Integration of Document Detection and Information Extraction.” In Proceedings of Tipster Phase II, Morgan-Kaufmann Publishers, San Francisco, CA. pp. 195–200.Google Scholar
  38. Harman, Donna K. (ed.). (1998). The Sixth Text REtrieval Conference (TREC-6). NIST Special Publication 500–240, National Institute of Standards and Technology, Gaithersburg, MD.Google Scholar
  39. Lewis, David D. and W. Bruce Croft. (1990). “Term Clustering of Syntactic Phrases”. Proceedings of ACM SIGIR-90, pp. 385–405.Google Scholar
  40. Mauldin, Michael. (1991). “Retrieval Performance in Ferret: A Conceptual Information Retrieval System.” Proceedings of ACM SIGIR-91, pp. 347–355.Google Scholar
  41. Metzler, Douglas P., Stephanie W. Haas, Cynthia L. Cosic, and Leslie H. Wheeler. (1989). “Constituent Object Parsing for Information Retrieval and Similar Text Processing Problems.” Journal of the ASIS, 40 (6), pp. 398–423.Google Scholar
  42. Mitra, Mandar, Chris Buckley, Amit Singhal, and Claire Cardie. (1997). “An Analysis of Statistical and Syntactic Phrases.” Proceedings of RIAO-97 Conference, Centre de Hautes Etudes Internationales d’Informatique Ducumentaires, pp. 200–214.Google Scholar
  43. Smeaton, Alan F. and C. J. van Rijsbergen. (1988). “Experiments on incorporating syntactic processing of user queries into a document retrieval strategy.” Proceedings of ACM SIGIR-88, pp. 31–51.Google Scholar
  44. Sparck Jones, K. and J. I. Tait. (1984). “Automatic search term variant generation.” Journal of Documentation, 40 (1), pp. 50–66.CrossRefGoogle Scholar
  45. Sparck Jones, K. (1998). “What is the Role of NLP in Text Retrieval?” This volume. Strzalkowski, Tomek. (1995). “Natural Language Information Retrieval” Information Processing and Management,31(3), pp. 397–417, Pergamon/Elsevier.Google Scholar
  46. Strzalkowski, Tomek, Fang Lin and Jose Perez-Carballo. (1998). “Natural Language Information Retrieval: TREC-6 Report.” Proceedings of The Sixth Text Retrieval Conference (TREC-6), NIST Special Publication 500–240. pp. 347–366.Google Scholar

Copyright information

© Springer Science+Business Media Dordrecht 1999

Authors and Affiliations

  • Joe Zhou
    • 1
  1. 1.LEXIS-NEXIS, a Division of Reed Elsevier, Inc.MiamisburgUSA

Personalised recommendations