Advertisement

Entropy-Based Static Index Pruning

  • Lei Zheng
  • Ingemar J. Cox
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5478)

Abstract

We propose a new entropy-based algorithm for static index pruning. The algorithm computes an importance score for each document in the collection based on the entropy of each term. A threshold is set according to the desired level of pruning and all postings associated with documents that score below this threshold are removed from the index, i.e. documents are removed from the collection. We compare this entropy-based approach with previous work by Carmel et al. [1], for both the Financial Times (FT) and Los Angeles Times (LA) collections. Experimental results reveal that the entropy-based approach has superior performance on the FT collection, for both precision at 10 (P@10) and mean average precision (MAP). However, for the LA collection, Carmel’s method is generally superior with MAP. The variation in performance across collections suggests that a hybrid algorithm that incorporates elements of both methods might have more stable performance across collections. A simple hybrid method is tested, in which a first 10% pruning is performed using the entropy-based method, and further pruning is performed by Carmel’s method. Experimental results show that the hybird algorithm can slightly improve that of Carmel’s, but performs significantly worse than the entropy-based method on the FT collection.

Keywords

Financial Time Mean Average Precision Index Table Importance Score Inverted Index 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Carmel, D., Cohen, D., Fagin, R., Farchi, E., Herscovici, M., Maarek, Y.S., Soffer, A.: Static index pruning for information retrieval systems. SIGIR, 43–50 (2001)Google Scholar
  2. 2.
    Manning, C.D., Raghavan, P., Schtze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)CrossRefzbMATHGoogle Scholar
  3. 3.
    Blanco, R., Barreiro, A.: Static pruning of terms in inverted files. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECIR 2007. LNCS, vol. 4425, pp. 64–75. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  4. 4.
    Buttcher, S., Clarke, C.L.A.: A document-centric approach to static index pruning in text retrieval systems. In: CIKM, pp. 182–189 (2006)Google Scholar
  5. 5.
    Ogilvie, P., Callan, J.: Experiments using the lemur toolkit. In: Proceedings of the Tenth Text Retrieval Conference, TREC-10 (2001)Google Scholar
  6. 6.
    Krovetz, R.: Viewing morphology as an inference process. In: SIGIR, pp. 191–202 (1993)Google Scholar
  7. 7.
    Fox, C.: A stop list for general text. SIGIR Forum 24(1-2), 19–21 (1990)CrossRefGoogle Scholar
  8. 8.
    Sparck-Jones, K., Walker, S., Robertson, S.E.: A probabilistic model of information retrieval. Information Processing and Management 36(6), 779–808 (2000)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Lei Zheng
    • 1
  • Ingemar J. Cox
    • 1
  1. 1.University College LondonSuffolkUnited Kingdom

Personalised recommendations