Advertisement

The VLDB Journal

, Volume 17, Issue 6, pp 1485–1507 | Cite as

Structural optimization of a full-text n-gram index using relational normalization

  • Min-Soo KimEmail author
  • Kyu-Young Whang
  • Jae-Gil Lee
  • Min-Jae Lee
Regular Paper

Abstract

As the amount of text data grows explosively, an efficient index structure for large text databases becomes ever important. The n-gram inverted index (simply, the n-gram index) has been widely used in information retrieval or in approximate string matching due to its two major advantages: language-neutral and error-tolerant. Nevertheless, the n-gram index also has drawbacks: the size tends to be very large, and the performance of queries tends to be bad. In this paper, we propose the two-level n-gram inverted index (simply, the n-gram/2L index) that significantly reduces the size and improves the query performance by using the relational normalization theory. We first identify that, in the (full-text) n-gram index, there exists redundancy in the position information caused by a non-trivial multivalued dependency. The proposed index eliminates such redundancy by constructing the index in two levels: the front-end index and the back-end index. We formally prove that this two-level construction is identical to the relational normalization process. We call this process structural optimization of the n-gram index. The n-gram/2L index has excellent properties: (1) it significantly reduces the size and improves the performance compared with the n-gram index with these improvements becoming more marked as the database size gets larger; (2) the query processing time increases only very slightly as the query length gets longer. Experimental results using real databases of 1 GB show that the size of the n-gram/2L index is reduced by up to 1.9–2.4 times and, at the same time, the query performance is improved by up to 13.1 times compared with those of the n-gram index. We also compare the n-gram/2L index with Makinen’s compact suffix array (CSA) (Proc. 11th Annual Symposium on Combinatorial Pattern Matching pp. 305–319, 2000) stored in disk. Experimental results show that the n-gram/2L index outperforms the CSA when the query length is short (i.e., less than 15–20), and the CSA is similar to or better than the n-gram/2L index when the query length is long (i.e., more than 15–20).

Keywords

Text search Inverted index n-gram Multivalued dependency 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Baeza-Yates, R., Navarro, G.: A practical q-gram index for text retrieval allowing errors. CLEI Electron. J. 1(2), (1998)Google Scholar
  2. 2.
    Baeza-Yates, R., Navarro, G.: Block addressing indices for approximate text retrieval. J. Am. Soc. Inf. Sci. 51(1), 69–82 (2000)CrossRefGoogle Scholar
  3. 3.
    Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. ACM Press (1999)Google Scholar
  4. 4.
    Barroso, L.A., Dean, J., Holzle, U.: Web search for a planet: the google cluster architecture. IEEE Micro 23(2), 22–28 (2003)CrossRefGoogle Scholar
  5. 5.
    Cao, X., Li, S.C., Tung, A.K.H.: Indexing DNA sequences using q-grams. In: Proc. Int’l Conf. on Database Systems for Advanced Applications (DASFAA). Beijing, pp. 4–16 (2005)Google Scholar
  6. 6.
    Elmasri, R., Navathe, S.B.: Fundamentals of Database Systems, 4th edn. Addison Wesley (2003)Google Scholar
  7. 7.
    Gao, J., Goodman, J., Li, M., Lee, K.: Toward a unified approach to statistical language modeling for Chinese. ACM Trans. Asian Lang. Inf. Process. (TALIP) 1(1), 3–33 (2002)CrossRefGoogle Scholar
  8. 8.
    Grossi, R., Vitter, J.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In: Proc. 32nd ACM Symposium on Theory of Computing (STOC), pp. 397–406 (2000)Google Scholar
  9. 9.
    Karkkainen, J., Rao, S.: 7. Full-text indexes in external memory. In: Algorithms for Memory Hierarchies pp. 149–170 (2003)Google Scholar
  10. 10.
    Karkkainen, J., Sutinen, E.: Lempel-Ziv index for q-grams. Algorithmica 21(1), 137–154 (1998)CrossRefMathSciNetGoogle Scholar
  11. 11.
    Karkkainen, J., Ukkonen, E.: Lempel-Ziv parsing and sublinear-size index structures for string mathcing. In: Proc. 3rd South American Workshop on String Processing (WSP), pp. 141–155 (1996)Google Scholar
  12. 12.
    Kim, M., Whang, K., Lee, J.: n-gram/2L-approximation: a two-level n-gram inverted index structure for approximate string matching. J. Comput. Systems Sci. Eng. (2007) (to appear)Google Scholar
  13. 13.
    Kim, M., Whang, K., Lee, J., Lee, M.: n-Gram/2L: a space and time efficient two-level n-gram inverted index structure. In: Proc. the 31th Int’l Conf. on Very Large Data Bases (VLDB), Trondheim, pp. 325–336 (2005)Google Scholar
  14. 14.
    Kukich, K.: Techniques for automatically correcting words in text. ACM Comput Surv 24(4), 377–439 (1992)CrossRefGoogle Scholar
  15. 15.
    Lee, J.H., Ahn J.S.: Using n-grams for korean text retrieval. In: Proc. Int’l Conf. on Information Retrieval. ACM SIGIR, Zurich, pp. 216–224 (1996)Google Scholar
  16. 16.
    Lehtinen, O., Sutinen, E., Tarhio, J.: Experiments on block indexing. In: Proc. 3rd South American Workshop on String Processing pp. 183–193 (1996)Google Scholar
  17. 17.
    Makinen, V.: Compact suffix array. In: Proc. 11th Annual Symposium on Combinatorial Pattern Matching (CPM), pp. 305–319 (2000)Google Scholar
  18. 18.
    Mayfield, J., McNamee, P.: Single N-gram stemming. In: Proc. Int’l Conf. on Information Retrieval. ACM SIGIR, Toronto, pp. 415–416 (2003)Google Scholar
  19. 19.
    Melnik, S., Raghavan, S., Yang, B., Garcia-Molina, H.: Building a distributed full-text index for the Web. ACM Trans. Inf. Systems 19(3), 217–241 (2001)CrossRefGoogle Scholar
  20. 20.
    Miller, E., Shen, D., Liu, J., Nicholas, C.: Performance and scalability of a large-scale N-gram based information retrieval system. J. Digital Inf. 1(5), 1–25 (2000)Google Scholar
  21. 21.
    Navarro, G., Baeza-Yates, R., Sutinen, E., Tarhio, J.: Indexing methods for approximate string matching. IEEE Data Eng Bull 24(4), 19–27 (2001)Google Scholar
  22. 22.
    Navarro, G., Makinen, V.: Compressed full-text indexes. Technical report TR/DCC-2006-6, Department of Computer Science, University of Chile, (2006). (accepted to ACM Computing Surveys)Google Scholar
  23. 23.
    Navarro, G., Sutinen, E., Tanninen, J., Tarhio, J.: Indexing text with approximate q-grams. In: Proc. 11th Annual Symposium on Combinatorial Pattern Matching (CPM), pp. 350–363 (2000)Google Scholar
  24. 24.
    Puglisi, S., Smyth, W., Turpin, A.: Inverted files versus suffix arrays for locating patterns in primary memory. In: Proc. 13th Symposium on String Processing and Information Retrieval (SPIRE), Glasgow, pp. 122–133 (2006)Google Scholar
  25. 25.
    Ramakrishnan, R.: Database Management Systems. McGraw-Hill, New York (1998)Google Scholar
  26. 26.
    Scholer, F., Williams, H.E., Yiannis, J., Zobel, J.: Compression of inverted indexes for fast query evaluation. In: Proc. Int’l Conf. on Information Retrieval, ACM SIGIR, Tampere, pp. 222–229 (2002)Google Scholar
  27. 27.
    Sutinen, E., Tarhio, J.: Filtration with q-samples in approximate string matching. In: Proc. 7th Annual Symposium on Combinatorial Pattern Matching (CPM), pp. 50–63 (1996)Google Scholar
  28. 28.
    Ullman, J.D.: Principles of Database and Knowledge-Base Systems, Vol. I. Computer Science Press, USA (1988)Google Scholar
  29. 29.
    Whang, K., Lee, M., Lee, J., Kim, M., Han, W.: Odysseus:a high-performance ORDBMS tightly-coupled with IR features. In: Proc. 21st IEEE Int’l Conf. on Data Engineering (ICDE), Tokyo, pp. 1104–1105, (2005) (this paper received the Best Demonstration Award)Google Scholar
  30. 30.
    Williams, H.E.: Genomic information retrieval. In: Proc. 14th Australasian Database Conferences, Adelaide, pp. 27–35 (2003)Google Scholar
  31. 31.
    Williams, H.E., Zobel, J.: Indexing and retrieval for genomic databases. IEEE Trans. Knowl. Data Eng. 14(1), 63–78 (2002)CrossRefGoogle Scholar
  32. 32.
    Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn., Morgan Kaufmann (1999)Google Scholar
  33. 33.
    Yasushi, O., Masajirou, I.: A new character-based indexing method using frequency data for Japanese documents. In: Proc. Int’l Conf. on Information Retrieval, pp. 121–129. ACM SIGIR, Seattle (1995)Google Scholar
  34. 34.
    Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Comput Surv 38(2), (2006)Google Scholar

Copyright information

© Springer-Verlag 2007

Authors and Affiliations

  • Min-Soo Kim
    • 1
    Email author
  • Kyu-Young Whang
    • 1
  • Jae-Gil Lee
    • 1
  • Min-Jae Lee
    • 1
  1. 1.Department of Computer ScienceKorea Advanced Institute of Science and Technology (KAIST)DaejeonSouth Korea

Personalised recommendations