Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

Learning to rank with (a lot of) word features


In this article we present Supervised Semantic Indexing which defines a class of nonlinear (quadratic) models that are discriminatively trained to directly map from the word content in a query-document or document-document pair to a ranking score. Like Latent Semantic Indexing (LSI), our models take account of correlations between words (synonymy, polysemy). However, unlike LSI our models are trained from a supervised signal directly on the ranking task of interest, which we argue is the reason for our superior results. As the query and target texts are modeled separately, our approach is easily generalized to different retrieval tasks, such as cross-language retrieval or online advertising placement. Dealing with models on all pairs of words features is computationally challenging. We propose several improvements to our basic model for addressing this issue, including low rank (but diagonal preserving) representations, correlated feature hashing and sparsification. We provide an empirical study of all these methods on retrieval tasks based on Wikipedia documents as well as an Internet advertisement task. We obtain state-of-the-art performance while providing realistically scalable methods.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2


  1. 1.

    This work is an expanded version of a poster paper (Bai et al. 2009) with further algorithmic proposals, applications and experiments.

  2. 2.

    In fact in our resulting methods there is no need to restrict that both q and d have the same dimensionality \(\mathcal D\) but we will make this assumption for simplicity of exposition.

  3. 3.

    Of course, any method can be sped up by applying it to only a subset of pre-filtered documents, filtering using some faster method.

  4. 4.


  5. 5.

    For example, Google provides such a service at http://translate.google.com/translate_s.

  6. 6.


  7. 7.


  8. 8.

    Oral presentation at the (Snowbird) Machine Learning Workshop, see http://snowbird.djvuzone.org/abstracts/119.pdf.

  9. 9.


  10. 10.

    We use the SVDLIBC software http://tedlab.mit.edu/∼dr/svdlibc/ and the cosine distance in the latent concept space.

  11. 11.

    We removed links to calendar years as they provide little information while being very frequent.

  12. 12.

    Note that the model \(W=U^\top V\) with the identity achieved a ranking loss of 0.56%, however, this model can represent at least some of the diagonal.

  13. 13.


  14. 14.



  1. Baeza-Yates, R., & Ribeiro-Neto, B., et al. (1999). Modern information retrieval. England: Addison-Wesley Harlow.

  2. Bai, B., Weston, J., Collobert, R., & Grangier, D. (2009). Supervised semantic indexing. In European conference on information retrieval.

  3. Berger, A., & Lafferty, J. (1999). Information retrieval as statistical translation. In ACM SIGIR’ 99, (pp. 222–229).

  4. Blei, D. M., & McAuliffe, J. D. (2007). Supervised topic models. In In advances in neural information processing systems (NIPS).

  5. Blei, D. M., Ng, A., & Jordan, M. I. (2003). Latent dirichlet allocation. The Journal of Machine Learning Research, 3, 993–1022.

  6. Bunescu, R., & Pasca, M. (2006). Using encyclopedic knowledge for named entity disambiguation. In In EACL, (pp. 9–16).

  7. Burges, C., Ragno, R., & Le, Q. V. (2006). Learning to rank with nonsmooth cost functions. In Advances in neural information processing systems: Proceedings of the 2006 conference. Cambridge, MA: MIT Press.

  8. Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., et al. (2005). Learning to rank using gradient descent. In ICML 2005 (pp. 89–96). New York: ACM Press.

  9. Cao, Z., Qin, T., Liu, T. Y., Tsai, M. F., & Li, H. (2007). Learning to rank: From pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning (pp. 129–136). New York: ACM Press.

  10. Caruana, R., Lawrence, S., & Giles. L. (2000). Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping. In Advances in neural information processing systems, NIPS 13 (pp. 402–408).

  11. Chernov, S., Iofciu, T., Nejdl, W., & Zhou, X. (2006). Extracting semantic relationships between wikipedia categories. In In 1st international workshop: SemWiki2006—From Wiki to semantics (SemWiki 2006), co-located with the ESWC2006 in Budva.

  12. Collins, M., & Duffy, N. (2001). New ranking algorithms for parsing and tagging: Kernels over discrete structures, and the voted perceptron. In Proceedings of the 40th annual meeting on association for computational linguistics (pp. 263–270). Morristown, NJ: Association for Computational Linguistics.

  13. Collobert, R., & Bengio, S. (2004). Links between perceptrons, mlps and svms. In ICML 2004.

  14. Cucerzan, S. (2007). Large-scale named entity disambiguation based on wikipedia data. In Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (pp. 708–716). Prague: Association for Computational Linguistics.

  15. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. JASIS, 41(6), 391–407.

  16. Dumais, S. T., Letsche, T. A., Littman, M. L., & Landauer, T. K. (1997). Automatic cross-language retrieval using latent semantic indexing. In AAAI spring symposium on cross-language text and speech retrieval.

  17. Gabrilovich, E. & Markovitch, S. (2007). Computing semantic relatedness using wikipedia-based explicit semantic analysis. In International joint conference on artificial intelligence.

  18. Gehler, P., Holub, A., & Welling, M. (2006). The rate adapting poisson (rap) model for information retrieval and object recognition. In Proceedings of the 23rd international conference on machine learning.

  19. Globerson, A., & Roweis, S. (2007). Visualizing pairwise similarity via semidefinite programming. In AISTATS.

  20. Goel, S., Langford, J., & Strehl, A. (2009). Predictive indexing for fast search. In Advances in neural information processing systems 21.

  21. Grangier, D., & Bengio, S. (2005). Inferring document similarity from hyperlinks. In CIKM ’05 (pp. 359–360). New York: ACM.

  22. Grangier, D., & Bengio, S., (2008). A discriminative kernel-based approach to rank images from text queries. IEEE Transactions on PAMI, 30(8), 1371–1384.

  23. Grefenstette, G. (1998). Cross-language information retrieval. Norwell, MA: Kluwer Academic Publishers.

  24. Guyon, I. M., Gunn, S. R., Nikravesh, M., & Zadeh, L. (Eds). (2006). Feature extraction: Foundations and applications. Berlin: Springer.

  25. Herbrich, R., Graepel, T., & Obermayer, K. (2000). Large margin rank boundaries for ordinal regression. Cambridge, MA: MIT Press

  26. Hofmann, T. (1999). Probabilistic latent semantic indexing. In SIGIR 1999 (pp. 50–57). New York: ACM Press.

  27. Hu, J., Fang, L., Cao, Y., Zeng, H., Li, H., Yang, Q., et al. (2008). Enhancing text clustering by leveraging wikipedia semantics. In SIGIR ’08: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval (pp. 179–186). New York: ACM.

  28. Jain, P., Kulis, B., Dhillon, I. S., & Grauman, K. (2008). Online metric learning and fast similarity search. In Advances in neural information processing systems (NIPS).

  29. Joachims, T. (2002). Optimizing search engines using clickthrough data. In ACM SIGKDD (pp. 133–142).

  30. Keller, M., & Bengio, S. (2005). A neural network for text representation. In International conference on artificial neural networks, ICANN, IDIAP-RR 05-12.

  31. Langford, J., Li, L., & Zhang, T. (2009). Sparse online learning via truncated gradient. In Advances in neural information processing systems 21.

  32. Liu, T. Y., Xu, J., Qin, T., Xiong, W., & Li, H. (2007). Letor: Benchmark dataset for research on learning to rank for information retrieval. In Proceedings of SIGIR 2007 workshop on learning to rank for information retrieval.

  33. Milne, D. N., Witten, I. H., & Nichols D. M. (2007). A knowledge-based search engine powered by wikipedia. In CIKM ’07: Proceedings of the sixteenth ACM conference on conference on information and knowledge management (pp. 445–454). New York: ACM.

  34. Minier, Z., Bodo, Z., & Csato, L. (2007). Wikipedia-based kernels for text categorization. In In 9th international symposium on symbolic and numeric algorithms for scientific computing (pp. 157–164).

  35. Ruiz-casado, M., Alfonseca, E., & Castells, P. (2005). Automatic extraction of semantic relationships for wordnet by means of pattern learning from wikipedia. In In NLDB pp. 67–79. Berlin: Springer.

  36. Salakhutdinov, R., & Hinton, G. (2007). Semantic hashing. Proceedings of the SIGIR workshop on information retrieval and applications of graphical models, Amsterdam.

  37. Shi, Q., Petterson, J., Dror, G., Langford, J., Smola, A., Strehl, A., & Vishwanathan, V. (2009). Hash kernels. In Twelfth international conference on artificial intelligence and statistics.

  38. Smadja, F., McKeown, K. R., & Hatzivassiloglou, V. (1996). Translating collocations for bilingual lexicons: A statistical approach.Computational Linguistics, 22(1), 1–38.

  39. Sun, J., Chen, Z., Zeng, H., Lu, Y., Shi, C., & Ma, W. (2004). Supervised latent semantic indexing for document categorization. In ICDM 2004 (pp. 535–538). Washington, DC: IEEE Computer Society.

  40. Vinokourov, A., Shawe-Taylor, J., & Cristianini, N. (2003). Inferring a semantic representation of text via cross-language correlation analysis. NIPS (pp. 1497–1504).

  41. Voorhees, E. M.,& Dang, H. T. (2005). Overview of the trec 2005 question answering track. In In TREC 2005.

  42. Wang, X., Sun, J., Chen, Z.,& Zhai, C. (2006). Latent semantic analysis for multiple-type interrelated data objects. In SIGIR’06.

  43. Weinberger, K., & Saul, L. (2008). Fast solvers and efficient implementations for distance metric learning. In International conference on machine learning.

  44. Yue, Y., Finley, T., Radlinski, F., & Joachims, T. (2007). A support vector method for optimizing average precision. In SIGIR (pp. 271–278).

  45. Zighelnic, L., & Kurland, O. (2008). Query-drift prevention for robust query expansion. In SIGIR 2008 (pp. 825–826). New York: ACM.

Download references

Author information

Correspondence to Bing Bai.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Bai, B., Weston, J., Grangier, D. et al. Learning to rank with (a lot of) word features. Inf Retrieval 13, 291–314 (2010). https://doi.org/10.1007/s10791-009-9117-9

Download citation


  • Semantic indexing
  • Feature hashing
  • Learning to rank
  • Cross language retrieval
  • Content matching