International Symposium on String Processing and Information Retrieval

SPIRE 2015: String Processing and Information Retrieval pp 1-12 | Cite as

Faster Exact Search Using Document Clustering

  • Jonathan Dimond
  • Peter SandersEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9309)


We show how full-text search based on inverted indices can be accelerated by clustering the documents without losing results (SeCluD – Search with Clustered Documents). We develop a fast multilevel clustering algorithm that uses query cost of conjunctive queries as an objective function. Depending on the inputs we get up to four times faster than non-clustered search. The resulting clusters are also useful for data compression and for distributing the work over many machines.


Cluster Algorithm Document Cluster Inverted Index Conjunctive Query Cluster Index 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Baeza-Yates, R.: A fast set intersection algorithm for sorted sequences. In: Sahinalp, S.C., Muthukrishnan, S.M., Dogrusoz, U. (eds.) CPM 2004. LNCS, vol. 3109, pp. 400–408. Springer, Heidelberg (2004) CrossRefGoogle Scholar
  2. 2.
    Cambazoglu, B.B., Plachouras, V., Baeza-Yates, R.: Quantifying performance and quality gains in distributed web search engines. In: 32nd ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 411–418. ACM (2009)Google Scholar
  3. 3.
    Clarke, C., Soboroff, I., Craswell, N.: Gov2 test collection (2004)Google Scholar
  4. 4.
    Dimond, J.: Faster full text search through document clustering. Diploma thesis, Kalsruhe Institute of Technology (2013)Google Scholar
  5. 5.
    Dimond, J., Sanders, P.: Faster exact search using document clustering (2014). CoRR, abs/1411.1220Google Scholar
  6. 6.
    Färber, F., et al.: SAP HANA Database: Data management for modern business applications. SIGMOD Rec. 40(4), 45–51 (2012)CrossRefGoogle Scholar
  7. 7.
    Manning, C.D., et al.: The Stanford CoreNLP natural language processing toolkit. In: 52nd Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55–60 (2014)Google Scholar
  8. 8.
    Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press (2008)Google Scholar
  9. 9.
    Moura, D., et al.: Improving web search efficiency via a locality based static pruning method. In: 14th World Wide Web Conference, pp. 235–244. ACM (2005)Google Scholar
  10. 10.
    Pass, G., Chowdhury, A., Torgeson, C.: A picture of search. In: 1st International Conference on Scalable Information Systems, p. 1. Citeseer (2006)Google Scholar
  11. 11.
    Persin, M.: Document filtering for fast ranking. In 17th ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 339–348 (1994)Google Scholar
  12. 12.
    Sanders, P., Transier, F.: Intersection in integer inverted indices. In: Workshop on Algorithm Engineering and Experiments (ALENEX) (2007)Google Scholar
  13. 13.
    Transier, F., Sanders, P.: Engineering basic algorithms of an in-memory text search engine. ACM Trans. Inf. Syst. 29(1) (2010)Google Scholar
  14. 14.
    Van Rijsbergen, C.: Information retrieval. Butterworths (1979)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Karlsruhe Institute of TechnologyKarlsruheGermany

Personalised recommendations