Advertisement

Yet Another Sorting-Based Solution to the Reassignment of Document Identifiers

  • Liang Shi
  • Bin Wang
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7675)

Abstract

Inverted file is generally used in search engines such as Web Search and Library Search, etc. Previous work demonstrated that the compressed size of inverted file can be significantly reduced through the reassignment of document identifiers. There are two main state-of-the-art solutions: URL sorting-based solution, which sorts the documents by the alphabetical order of the URLs; and TSP-based solution, which considers the reassignment as Traveling Salesman Problem. These techniques achieve good compression, while have significant limitations on the URLs and data size. In this paper, we propose an efficient solution to the reassignment problem that first sorts the terms in each document by document frequency and then sorts the documents by the presence of the terms. Our approach has few restrictions on data sets and is applicable to various situations. Experimental results on four public data sets show that compared with the TSP-based approach, our approach reduces the time complexity from O(n 2) to \(O(\overline{|D|} \cdot n\log n)\) (\(\overline{|D|}\): average length of n documents), while achieving comparative compression ratio; and compared with the URL-sorting based approach, our approach improves the compression ratio up to 10.6% with approximately the same run-time.

Keywords

Inverted File Index Compression Reassignment of Document Identifiers TERM sorting-based 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann, San Francisco (1999)Google Scholar
  2. 2.
    Scholer, F., Williams, H.E., Yiannis, J., Zobel, J.: Compression of inverted indexes for fast query evaluation. In: SIGIR, pp. 222–229. ACM (2002)Google Scholar
  3. 3.
    Moffat, A., Stuiver, L.: Binary interpolative coding for effective index compression. Inf. Retr. 3(1), 25–47 (2000)CrossRefGoogle Scholar
  4. 4.
    Williams, H.E., Zobel, J.: Compressing integers for fast file access. Comput. J. 42(3), 193–201 (1999)CrossRefGoogle Scholar
  5. 5.
    Yan, H., Ding, S., Suel, T.: Inverted index compression and query processing with optimized document ordering. In: WWW, pp. 401–410. ACM (2009)Google Scholar
  6. 6.
    Zukowski, M., Héman, S., Nes, N., Boncz, P.: Super-scalar ram-cpu cache compression. In: ICDE, p. 59. IEEE Computer Society Press (2006)Google Scholar
  7. 7.
    Elias, P.: Universal codeword sets and representations of the integers. IEEE Transactions Information Theory 21(2), 194–203 (1975)MathSciNetzbMATHCrossRefGoogle Scholar
  8. 8.
    Silvestri, F., Venturini, R.: Vsencoding: efficient coding and fast decoding of integer lists via dynamic programming. In: CIKM, pp. 1219–1228. ACM (2010)Google Scholar
  9. 9.
    Rice, R., Plaunt, J.: Adaptive variable-length coding for efficient compression of spacecraft television data. IEEE Transactions Communication Technology 19(6), 889–897 (1971)CrossRefGoogle Scholar
  10. 10.
    Blandford, D.K., Blelloch, G.E.: Index compression through document reordering. In: DCC, pp. 342–351. IEEE Computer Society (2002)Google Scholar
  11. 11.
    Shieh, W.-Y., Chen, T.-F., Shann, J.J.-J., Chung, C.-P.: Inverted file compression through document identifier reassignment. Inf. Process. Manage. 39(1), 117–131 (2003)zbMATHCrossRefGoogle Scholar
  12. 12.
    Silvestri, F.: Sorting Out the Document Identifier Assignment Problem. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECIR 2007. LNCS, vol. 4425, pp. 101–112. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  13. 13.
    Anh, V.N., Moffat, A.: Inverted index compression using word-aligned binary codes. Inf. Retr. 8(1), 151–166 (2005)CrossRefGoogle Scholar
  14. 14.
    Anh, V.N., Moffat, A.: Index compression using 64-bit words. Software: Practice and Experience 40(2), 131–147 (2010)Google Scholar
  15. 15.
    Silvestri, F., Orlando, S., Perego, R.: Assigning identifiers to documents to enhance the clustering property of fulltext indexes. In: SIGIR, pp. 305–312. ACM (2004)Google Scholar
  16. 16.
    Blanco, R., Barreiro, A.: Document Identifier Reassignment Through Dimensionality Reduction. In: Losada, D.E., Fernández-Luna, J.M. (eds.) ECIR 2005. LNCS, vol. 3408, pp. 375–387. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  17. 17.
    Ding, S., Attenberg, J., Suel, T.: Scalable techniques for document identifier assignment ininverted indexes. In: WWW, pp. 311–320. ACM (2010)Google Scholar
  18. 18.
    Blanco, R., Barreiro, A.: Characterization of a simple case of the reassignment of document identifiers as a pattern sequencing problem. In: SIGIR, pp. 587–588. ACM (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Liang Shi
    • 1
    • 2
  • Bin Wang
    • 1
  1. 1.Institute of Computing TechnologyChinese Academy of SciencesChina
  2. 2.Graduate School of the Chinese Academy of SciencesBeijingChina

Personalised recommendations