Advertisement

The reference string indexing method

  • H. -J. Schek
Implementation And Simulation Techniques
Part of the Lecture Notes in Computer Science book series (LNCS, volume 65)

Abstract

The motivation for the reference string indexing method may be derived from the intention to retrieve any piece of information by specifying arbitrary parts of it. Common restrictions such as the usage only of a certain set of descriptors or (complete) keywords in document retrieval systems or the specification of only certain (inverted) attributed values for queries in formatted files should be removed without loosing performance necessary for interactive usage.

The solution to be described is essentially based on the realistic assumption that the frequency distribution for the occurrence of character strings with a certain length, or words, or word sequences in textual files, and also for the occurrence of attribute values or value combinations in formatted files is not uniform but rather highly hyperbolic or "Zipfian". The same is valid also for the usage of data, expressed as the "80-20"-law. Exploiting this assumption, a (small) set of "reference strings" is generated by a statistical analysis of collected queries or — if not available — by usage estimation with the original data. The inversion to these reference strings with respect to records or record clusters gives the reference string index.

Corresponding to the estimated usage frequency, a search argument may have been made available completely as a reference string or has to be decomposed into shorter reference strings. Therefore, the reference string access is adaptive with the consequence that a routine query may be answered faster than a non-routine one.

The reference string index may be applied as a new adapted index in information retrieval systems as well as in formatted files as single or multi-attribute index. In addition it can be applied for phonetic and general record similarity search.

Keywords

Partial Match Access Rate Inverted List Index List Reference String 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. AHO75.
    A. V. Aho, Margret J. Corasick, Efficient String Matching: An Aid to Bibliographic Search, Comm. ACM (1875), Vol. 18, No. 6, pp. 333–340Google Scholar
  2. AHO74.
    A. V. Aho, The Design and Analysis of Computer Algorithms, Addison-Wesley Publishing Company, Reading, (Mass.) 1974Google Scholar
  3. BA75.
    J. J. Barton, S. E. Creasy, M. R. Lynch, M. J. Snell, An Information-Theoretic Approach to Text Searching in Direct Access Systems, Comm. ACM (1974), Vol. 17, No. 6, pp. 345–350Google Scholar
  4. BAY73.
    R. Bayer, E. McCreight, Organization and Maintenance of Large Ordered Indexes, Acta Informatica 1 (1972), pp. 173–189Google Scholar
  5. BE75.
    J. L. Bentley, Multidimensional Binary Search Trees Used for Associative Searching, Comm. ACM (1975), Vol. 18, No. 9, pp. 509–517Google Scholar
  6. BU76.
    W. A. Burkhard, Hasing and Trie Algorithms for Partial Match Retrieval, ACM Transactions on Data Base Systems, (1976), Vol. 1, No. 2, pp. 175–187Google Scholar
  7. CLA72.
    A. C. Clare, E. M. Cook, M. F. Lynch, The Identification of Variable-Length, Equifrequent Character Strings in a Natural Language Data Base, Computer Journal Vol. 15, No. 3, pp. 259–262Google Scholar
  8. GE76.
    F. Gebhardt, Wortstatistiken an groesseren Textsammlungen, Nachrichten f. Dokumentation, 2–1977, Hrsg. von der Deutschen Gesellschaft f. Dokumentation e.V., Seite 53–58Google Scholar
  9. HA71.
    M. C. Harrison, Implementation of the Substring Test by Hashing, Comm. ACM (1971), Vol. 14, No. 12, pp. 777–779Google Scholar
  10. HE74.
    R. Henzler, Quantitative Beziehungen zwischen Textlaengen und Wortschatz, Hrg. Zentralstelle fuer maschinelle Dokumentation, Frankfurt, Nr. ZMD-A-23, Beuth-Verlag, Frankfurt, 1974Google Scholar
  11. IZ77.
    H. Izbicki, Composita Program, Documentation Draft, IBM Laboratory Vienna, March 1977Google Scholar
  12. KNU73.
    D. E. Knuth, The Art of Computer Programming, Sorting and Searching, Addison-Wesley Publishing Company, Reading, (Mass.) 1973Google Scholar
  13. KNU74.
    D. E. Knuth et al, Fast Pattern Matching in Strings, Technical Report No. STAN-CA-74-440, 1974Google Scholar
  14. LUS67.
    G. Lustig, A New Class of Association factors, in Mechanized Information Storage, retrieval and Dissemination, (ed. K. Samuelson), Proceedings of the FID-IFIP Conf., Rome, 1967, North-Holland Publ. Comp. Amsterdam 1968.Google Scholar
  15. LU70.
    V. Y. Lum, Multi-attribute Retrieval with Combined Indexes, Comm. ACM, (1970), Vol. 13, No. 11, pp. 66–665Google Scholar
  16. MAU75.
    W. D. Maurer, T.G. Lewis, Hash Table Methods, Computing Surveys, (1975), Vol. 7, No. 1, pp. 6–19Google Scholar
  17. MCR74.
    E. M. McCreight, A Space-Economical Suffix Tree Construction Algorithm, JACM (1976), Vol. 23, No. 2, pp. 262–272Google Scholar
  18. NU76.
    R. Nussbaum, Diskussion verschiedener Aehnlichkeitsanordnungen in grossen Wortlisten, Diplomarbeit Universitaet Mannheim, Institut f. Wirtschaftsinformatik, 1977.Google Scholar
  19. SA68.
    G. Salton, Automatic Information Organization and Retrieval, Mc Graw-Hill, New York, 1968Google Scholar
  20. SCHE75.
    H.-J. Schek, Tolerating Fuzziness in Keywords by Similarity Searches, IBM Scientific Center, Heidelberg (1975), Technical Report TR 75.11.010 contained in Kybernetes 6 (1977) Special Issue on Fuzzy SystemsGoogle Scholar
  21. SCHE77.
    H.-J. Schek, The Reference String Access Method and Partial Match Retrieval, IBM Scientific Center Technical Report TR77.12.009.Google Scholar
  22. SCHO77.
    G. Schott, Automatische Kompositazerlegung mit einem Minimalwoerterbuch, Vortrag bei der Fruehjahrstagung GMDS-GI, Giessen, April 1977Google Scholar
  23. STE74.
    I. Steinacker, Indexing and Automatic Significance Analysis, Journal of the American Society for Information Science, (1974), Vol. 25, No. 4, pp. 237–241Google Scholar
  24. WA73.
    R. E. Wagner, Indexing Design Considerations, IBM Systems Journal, (1973), No. 4, pp. 351–367Google Scholar
  25. WE75.
    H. Wedekind, T. Haerder, Datenbanksysteme II, Reihe Informatik) 18, Bibliographisches Institut Mannheim/Wien/Zürich, B.I.-Wissenschaftsverlag 1976Google Scholar
  26. WO71.
    E. Wong, T. C. Chiang, Canonical Structure in Attribute Based File Organization, Comm. ACM, (1971), Vol. 14, No. 9, pp. 593–597Google Scholar
  27. YA77.
    S. Yamamoto, S. Tazawa, K. Ushio, H. Ikeda: Design of a Balanced Multiple-Valued File Organization Scheme with the Least Redundancy, Proc. of the Very Large Data Base Conf., Tokio, Oct. 1977, p.230.Google Scholar
  28. ZI49.
    G. Zipf, Human Behaviour and the Principle of Least Effort, Addison-Wesley, Cambridge, Mass. 1949.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1978

Authors and Affiliations

  • H. -J. Schek
    • 1
  1. 1.IBM Wissenschaftliches ZentrumHeidelberg

Personalised recommendations