Advertisement

Journal of Computer Science and Technology

, Volume 31, Issue 1, pp 147–166 | Cite as

AS-Index: A Structure for String Search Using n-Grams and Algebraic Signatures

  • Camelia Constantin
  • Cédric du Mouza
  • Witold Litwin
  • Philippe Rigaux
  • Thomas Schwarz
Regular Paper

Abstract

We present the AS-Index, a new index structure for exact string search in disk resident databases. AS-Index relies on a classical inverted file structure, whose main innovation is a probabilistic search based on the properties of algebraic signatures used for both n-grams hashing and pattern search. Specifically, the properties of our signatures allow to carry out a search by inspecting only two of the posting lists. The algorithm thus enjoys the unique feature of requiring a constant number of disk accesses, independently from both the pattern size and the database size. We conduct extensive experiments on large datasets to evaluate our index behavior. They confirm that it steadily provides a search performance proportional to the two disk accesses necessary to obtain the posting lists. This makes our structure a choice of interest for the class of applications that require very fast lookups in large textual databases. We describe the index structure, our use of algebraic signatures, and the search algorithm. We discuss the operational trade-offs based on the parameters that affect the behavior of our structure, and present the theoretical and experimental performance analysis. We next compare the AS-Index with the state-of-the-art alternatives and show that 1) its construction time matches that of its competitors, due to the similarity of structures, 2) as for search time, it constantly outperforms the standard approach, thanks to the economical access to data complemented by signature calculations, which is at the core of our search method.

Keywords

full text indexing large-scale indexing algebraic signature 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [1]
    Margaritis G, Anastasiadis S V. SeFS: Unleashing the power of full-text search on file systems. In Proc. the 5th USENIX Conf. File and Storage Technology, Feb. 2007, Article No. 12.Google Scholar
  2. [2]
    Crochemore M, Lecroq T. Pattern matching and text-compression algorithms. ACM Computing Surveys, 1996, 28(1): 39–41.CrossRefGoogle Scholar
  3. [3]
    Ferragina P, Grossi R. The String B-tree: A new data structure for string search in external memory and its applications. J. ACM, 1999, 46(2): 236–280.zbMATHMathSciNetCrossRefGoogle Scholar
  4. [4]
    Phoophakdee B, Zaki M J. Genome-scale diskbased suffix tree indexing. In Proc. Int. Conf. Management of Data (SIGMOD), June 2007, pp.833-844.Google Scholar
  5. [5]
    Miller E, Shen D, Liu J, Nicholas C. Performance and scalability of a large-scale n-gram based information retrieval system. Journal of Digital Information, 2000.Google Scholar
  6. [6]
    Kim M S, Whang K, Lee J G, Lee M J. n-Gram/2L: A space and time efficient two-level n-gram inverted index structure. In Proc. the 31st Int. Conf. Very Large Data Bases (VLDB), Aug. 2005, pp.325-336.Google Scholar
  7. [7]
    Litwin W, Schwarz T. Algebraic signatures for scalable distributed data structures. In Proc. the 20th Int. Conf. Data Engineering (ICDE), March 2004, pp.412-423.Google Scholar
  8. [8]
    du Mouza C, Litwin W, Rigaux P, Schwarz T J E. AS-index: A structure for string search using n-grams and algebraic signatures. In Proc. the 18th Int. Conf. Information and Knowledge Management (CIKM), Nov. 2009, pp.295-304.Google Scholar
  9. [9]
    Gray J, Fitzgerald B. Flash disk opportunity for server applications. ACM Queue, 2008, 6(4): 18–23.CrossRefGoogle Scholar
  10. [10]
    Charras C, Lecroq T, Pehoushek J D. A very fast string matching algorithm for small alphabets and long patterns. In Proc. the 9th Int. Symp. Combinatorial Pattern Matching (CPM), July 1998, pp.55-64.Google Scholar
  11. [11]
    Witten I, Moffat A, Bell T. Managing Gigabytes: Compressing and Indexing Documents and Images (1st edition). Morgan-Kaufmann, 1999.Google Scholar
  12. [12]
    Na J C, Park K. Simple implementation of String B-Tree. In Proc. the 11th Int. Conf. String Processing and Information Retrieval (SPIRE), Oct. 2004, pp.214-215.Google Scholar
  13. [13]
    Baeza-Yates R, Ribeiro-Neto B. Modern Information Re-trieval. Addison-Wesley, 1999.Google Scholar
  14. [14]
    Robenek D, Platoš J, Snášel V. Efficient inmemory data structures for n-grams indexing. In Proc. Int. Work. Databases, Texts, Specifications and Objects (DATESO), April 2013, pp.48-58.Google Scholar
  15. [15]
    Gusfield D. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology (1st edition). Cambridge University Press, 1997.Google Scholar
  16. [16]
    Kurtz S. Reducing the space requirement of suffix trees. Software — Practice and Experience, 1999, 29(13): 1149–1171.CrossRefGoogle Scholar
  17. [17]
    Tata S, Hankins R, Patel J. Practical suffix tree construction. In Proc. the 30th Int. Conf. Very Large Databases (VLDB), Aug. 2004, pp.36-48.Google Scholar
  18. [18]
    Manber U, Myers G. Sufix arrays: A new method for on-line string searches. SIAM Journal on Computing, 1993, 22(5): 935–948.zbMATHMathSciNetCrossRefGoogle Scholar
  19. [19]
    Kärkkäinen J. Suffix cactus: A cross between suffix tree and suffix array. In Proc. the 6th Int. Symp. Combinatorial Pattern Matching (CPM), July 1995, pp.191-204.Google Scholar
  20. [20]
    Andersson A, Nilsson S. Efficient implementation of suffix trees. Software — Practice and Experience, 1995, 25(2): 129–141.CrossRefGoogle Scholar
  21. [21]
    Dementiev R, Kärkkäinen J, Mehnert J, Sanders P. Better external memory suffix array construction. ACM Journal of Experimental Algorithmics, 2008, 12: Article No. 3.4.Google Scholar
  22. [22]
    Barsky M, Thomo A, Stege U. Full-Text (Substring) Indexes in External Memory. Morgan & Claypool Publishers, 2011.Google Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  • Camelia Constantin
    • 1
  • Cédric du Mouza
    • 2
  • Witold Litwin
    • 3
  • Philippe Rigaux
    • 2
  • Thomas Schwarz
    • 4
  1. 1.LIP6 LaboratoryUniversity Pierre et Marie CurieParisFrance
  2. 2.CEDRIC LaboratoryConservatoire National des Arts et MétiersParisFrance
  3. 3.LAMSADE LaboratoryUniversity Paris-DauphineParisFrance
  4. 4.DICC LaboratoryUniversidad Católica del UruguayMontevideoUruguay

Personalised recommendations