Skip to main content

Document Identifier Reassignment Through Dimensionality Reduction

  • Conference paper
Advances in Information Retrieval (ECIR 2005)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3408))

Included in the following conference series:

Abstract

Most modern retrieval systems use compressed Inverted Files (IF) for indexing. Recent works demonstrated that it is possible to reduce IF sizes by reassigning the document identifiers of the original collection, as it lowers the average distance between documents related to a single term. Variable-bit encoding schemes can exploit the average gap reduction and decrease the total amount of bits per document pointer. However, approximations developed so far requires great amounts of time or use an uncontrolled memory size. This paper presents an efficient solution to the reassignment problem consisting in reducing the input data dimensionality using a SVD transformation. We tested this approximation with the Greedy-NN TSP algorithm and one more efficient variant based on dividing the original problem in sub-problems. We present experimental tests and performance results in two TREC collections, obtaining good compression ratios with low running times. We also show experimental results about the tradeoff between dimensionality reduction and compression, and time performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bartell, B.T., Cottrel, G.W., Belew, R.K.: Latent Semantic Indexing is an optimal special case of Multidimensional Scaling. In: Proceeding of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 161–167 (1992)

    Google Scholar 

  2. Blandford, D., Blelloch, G.: Index compression through document reordering. In: Proceedings of the IEEE Data Compression Conference (DCC 2002), pp. 342–351 (2002)

    Google Scholar 

  3. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science 41(6), 391–407 (1990)

    Article  Google Scholar 

  4. Dumais, S.T.: Latent Semantic Indexing (LSI): TREC-3 Report. In: Proceedings of the Third Text REtrieval Conference (TREC-3), NIST Special Publication 500-225 (November 1994)

    Google Scholar 

  5. Managing Gigabytes, http://www.cs.mu.oz.au/mg/

  6. MG4J (Managing Gigabytes for Java), http://mg4j.dsi.unimi.it/

  7. Moffat, A., Turpin, A.: Compression and Coding Algorithms. Kluwer, Dordrecht (2002)

    Google Scholar 

  8. Rivest, R.: RFC 1321: The md5 algorithm

    Google Scholar 

  9. Shieh, W.-Y., Chen, T.-F., Shann, J.J.-J., Chung, C.-P.: Inverted file compression through document identifier reassignment. Information Processing and Management 39(1), 117–131 (2003)

    Article  MATH  Google Scholar 

  10. Silvestri, F., Orlando, S., Perego, R.: Assigning identifiers to documents to enhance the clustering property of fulltext indexes. In: Proceeding of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 305–312 (2004)

    Google Scholar 

  11. SVDLIBC, http://tedlab.mit.edu/~dr/SVDLIBC/

  12. Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes - Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann Publishing, San Francisco (1999)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Blanco, R., Barreiro, Á. (2005). Document Identifier Reassignment Through Dimensionality Reduction. In: Losada, D.E., Fernández-Luna, J.M. (eds) Advances in Information Retrieval. ECIR 2005. Lecture Notes in Computer Science, vol 3408. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-31865-1_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-31865-1_27

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-25295-5

  • Online ISBN: 978-3-540-31865-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics