Two basic approaches to the development of special topic document corpora are considered. Both techniques result in a partitioning of a general heterogeneous parent corpus into distinct subsets of related documents. The first method is based on the partitioning of a document corpus with respect to the topical content explicitly defined by the documents contained in the corpus. The technique relies on a logicosyntactic analysis of the document text in order to extract topic-denoting phrases, and a weighting function based on the complexity of the logical relational environment of the extracted phrases. The second method is based on a profile-directed partitioning of the document corpus induced by an externally defined thesaurus of phrases. The topic coverage of the profile depends only on the specific requirements of the user community for whom it was defined. Any one of a number of weighting functions can be applied to the phrases and usually depends on the corpus itself. This technique is useful where text analysis is either impractical or not possible.
KeywordsNoun Phrase Document Text Retrieval Accuracy Characteristic Term Weighting Algorithm
Unable to display preview. Download preview PDF.
- 1.A. J. Kasarda and D. J. Hillman, The LEADERMART System and Service, inProc. ACM 72 Nat. Conf. ( Boston, August 1972 ).Google Scholar
- 2.D. J. Hillman, An Algorithm for Document Characterization, Report No. 2, Mathematical Theories of Relevance with Respect to the Problems of Indexing, National Science Foundation Grant No. GN-177 (March 12, 1965 ).Google Scholar
- 3.M. B. Leibowitz,A Process for Automated Logico-Syntactic Analysis of Natural English Sentences, Ph.D. Diss., National Science Foundation Grant Nos. GN-668 and GN-845 (September 1970).Google Scholar
- 4.N. Goodman,The Structure of Appearance, Harvard Press (1951).Google Scholar
- 5.D. J. Hillman,The Measurement of Simplicity Philosophy of Science;29(3) (July 1962).Google Scholar
- 6.D. J. Hillman, Characterization and Connectivity, Report No. 1,Document Retrieval Theory, Relevance, and the Methodology of Evaluation, National Science Foundation Grant No. GN-451 (May 24, 1966 ).Google Scholar
- 7.D. J. Hillman and A. J. Kasarda, The LEADER Retrieval System, inAFIPS Conf Proc.: Spring Joint Computer Conf, ( Boston, May 1969 ).Google Scholar
- 8.D. J. Hillman, The Structure of Document Relations, Report No. 8, Study of Theories and Models of Information Storage and Retrieval, National Science Foundation Grant No. GN-283 (August 25, 1964 ).Google Scholar