Advertisement

Sorting Out the Document Identifier Assignment Problem

  • Fabrizio Silvestri
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4425)

Abstract

The compression of Inverted File indexes in Web Search Engines has received a lot of attention in these last years. Compressing the index not only reduces space occupancy but also improves the overall retrieval performance since it allows a better exploitation of the memory hierarchy. In this paper we are going to empirically show that in the case of collections of Web Documents we can enhance the performance of compression algorithms by simply assigning identifiers to documents according to the lexicographical ordering of the URLs. We will validate this assumption by comparing several assignment techniques and several compression algorithms on a quite large document collection composed by about six million documents. The results are very encouraging since we can improve the compression ratio up to 40% using an algorithm that takes about ninety seconds to finish using only 100 MB of main memory.

Keywords

Compression Ratio Assignment Problem Compression Algorithm External Memory Assignment Algorithm 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Anh, V.N., Moffat, A.: Inverted index compression using word-aligned binary codes. Inf. Retr. 8(1), 151–166 (2005)CrossRefGoogle Scholar
  2. 2.
    Anh, V.N., Moffat, A.: Simplified similarity scoring using term ranks. In: SIGIR ’05: Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval, Salvador, Brazil, pp. 226–233. ACM Press, New York (2005)CrossRefGoogle Scholar
  3. 3.
    Blanco, R., Barreiro, A.: Characterization of a simple case of the reassignment of document identifiers as a pattern sequencing problem. In: SIGIR ’05: Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval, Salvador, Brazil, pp. 587–588. ACM Press, New York (2005), doi:10.1145/1076034.1076141CrossRefGoogle Scholar
  4. 4.
    Blanco, R., Barreiro, A.: Document Identifier Reassignment Through Dimensionality Reduction. In: Losada, D.E., Fernández-Luna, J.M. (eds.) ECIR 2005. LNCS, vol. 3408, p. 375. Springer, Heidelberg (2005)Google Scholar
  5. 5.
    Blandford, D., Blelloch, G.: Index Compression through Document Reordering. In: Proceedings of the Data Compression Conference (DCC’02), Washington, DC, USA, pp. 342–351. IEEE Computer Society Press, Los Alamitos (2002)Google Scholar
  6. 6.
    Boldi, P., Vigna, S.: The webgraph framework i: compression techniques. In: WWW ’04: Proceedings of the 13th international conference on World Wide Web, pp. 595–602. ACM Press, New York (2004), doi:10.1145/988672.988752CrossRefGoogle Scholar
  7. 7.
    Bookstein, A., Klein, S.T., Raita, T.: Modeling word occurrences for the compression of concordances. ACM Trans. Inf. Syst. 15(3), 254–290 (1997), doi:10.1145/256163.256166CrossRefGoogle Scholar
  8. 8.
    Buckley, C.: Implementation of the smart information retrieval system. Technical Report TR85–686, Cornell University, Computer Science Department (May 1985)Google Scholar
  9. 9.
    Luhn, H.P.: The Automatic Creation of Literature Abstracts. IBM Journal of Research Development 2(2), 159–165 (1958)MathSciNetCrossRefGoogle Scholar
  10. 10.
    Randall, K.H., et al.: The link database: Fast access to graphs of the web. In: DCC ’02: Proceedings of the Data Compression Conference, Washington, DC, USA, p. 122. IEEE Computer Society Press, Los Alamitos (2002)CrossRefGoogle Scholar
  11. 11.
    Scholer, F., et al.: Compression of inverted indexes for fast query evaluation. In: SIGIR ’02: Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval, Tampere, Finland, pp. 222–229. ACM Press, New York (2002), doi:10.1145/564376.564416CrossRefGoogle Scholar
  12. 12.
    Shieh, W.-Y., et al.: Inverted file compression through document identifier reassignment. Information Processing and Management 39 (1), 117–131 (2003)zbMATHCrossRefGoogle Scholar
  13. 13.
    Silvestri, F., Orlando, S., Perego, R.: Assigning identifiers to documents to enhance the clustering property of fulltext indexes. In: SIGIR ’04: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval, Sheffield, United Kingdom, pp. 305–312. ACM Press, New York (2004), doi:10.1145/1008992.1009046Google Scholar
  14. 14.
    Trotman, A.: Compressing Inverted Files. Information Retrieval 6 (1), 5–19 (2003)CrossRefGoogle Scholar
  15. 15.
    Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes – Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann Publishing, San Francisco (1999)Google Scholar

Copyright information

© Springer Berlin Heidelberg 2007

Authors and Affiliations

  • Fabrizio Silvestri
    • 1
  1. 1.Institute for Information Science and Technologies, ISTI - CNR, via Moruzzi, 1, 56126 PisaItaly

Personalised recommendations