Advertisement

Hybrid index organizations for text databases

  • Christos Faloutsos
  • H. V. Jagadish
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 580)

Abstract

Due to the skewed nature of the frequency distribution of term occurrence (e.g., Zipf's law) it is unlikely that any single technique for indexing text can do well in all situations. In this paper we propose a hybrid approach to indexing text, and show how it can outperform the traditional inverted B-tree index both in storage overhead, in time to perform a retrieval, and, for dynamic databases, in time for an insertion, both for single term and for multiple term queries. We demonstrate the benefits of our technique on a database of stories from the Associated Press news wire, and we provide formulae and guidelines on how to make optimal choices of the design parameters in real applications.

Keywords

Search Time Frequent Term Inverted Index Insertion Time Disk Access 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Bourne, C.P., “Frequency and Impact of Spelling Errors in Bibliographic Databases,” Information Processing and Management 13(1), pp. 1–12 (1977).Google Scholar
  2. 2.
    Cardenas, A.F., “Analysis and Performance of Inverted Data Base Structures,” CACM 18(5), pp. 253–263 (May 1975).Google Scholar
  3. 3.
    Christodoulakis, S., M. Theodoridou, F. Ho, M. Papa, and A. Pathria, “Multimedia Document Presentation. Information Extraction and Document Formation in MINOS: A Model and a System,” ACM TOOIS 4(4) (Oct. 1986).Google Scholar
  4. 4.
    Comer, D., “The Ubiquitous B-Tree,” Computing Surveys 11(2), pp. 121–137 (June 1979).Google Scholar
  5. 5.
    Faloutsos, C., “Access Methods for Text,” ACM Computing Surveys 17(1), pp. 49–74 (March 1985).Google Scholar
  6. 6.
    Faloutsos, C. and R. Chan, “Fast Text Access Methods for Optical and Large Magnetic Disks: Designs and Performance Comparison,” Proc. 14th International Conf. on VLDB, Long Beach, California, pp. 280–293 (Aug. 1988).Google Scholar
  7. 7.
    Faloutsos, C., “Signature-Based Text Retrieval Methods: A Survey,” IEEE Data Engineering 13(1), pp. 25–32 (March 1990).Google Scholar
  8. 8.
    Haskin, R.L., “Special-Purpose Processors for Text Retrieval,” Database Engineering 4(1), pp. 16–29 (Sept. 1981).Google Scholar
  9. 9.
    Hollaar, L.A., K.F. Smith, W.H. Chow, P.A. Emrath, and R.L. Haskin, “Architecture and Operation of a Large, Full-Text Information-Retrieval System,” pp. 256–299 in Advanced Database Machine Architecture, ed. D.K. Hsiao, Prentice-Hall, Englewood Cliffs, New Jersey (1983).Google Scholar
  10. 10.
    King, D. R., “The Binary Vector as the Basis of an Inverted Index File,” J. Lib. Autom. 7(4), p. 307 (1974).Google Scholar
  11. 11.
    Lesk, M.E., “Some Applications of Inverted Indexes on the UNIX System,” UNIX Programmer's Manual, Bell Laboratories, Murray Hill, New Jersey (1978).Google Scholar
  12. 12.
    Lin, Z. and C. Faloutsos, “Frame Sliced Signature Files,” CS-TR-2146 and UMIACS-TR-88-88. Dept. of Computer Science, Univ. of Maryland (Dec. 1988).Google Scholar
  13. 13.
    Rothnic, J.B. and T. Lozano, “Attribute Based File Organization in a Paged Memory Environment,” CACM 17(2), pp. 63–69 (Feb. 1974).Google Scholar
  14. 14.
    Salton, G. and M. J. McGill, Introduction to Modern Information Retrieval, McGraw-Hill (1983).Google Scholar
  15. 15.
    Schuegraf, E.J., “Compression of Large Inverted Files with Hyperbolic Term Distribution,” Information Processing and Management 12, pp. 377–384 (1976).Google Scholar
  16. 16.
    Zipf, G.K., Human Behavior and Principle of Least Effort: An Introduction to Human Ecology, Addison Wesley, Cambridge, Massachusetts (1949).Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1992

Authors and Affiliations

  • Christos Faloutsos
    • 1
  • H. V. Jagadish
    • 2
  1. 1.University of MarylandCollege Park
  2. 2.AT&T Bell LaboratoriesMurray Hill

Personalised recommendations