Building a Microblog Corpus for Search Result Diversification

  • Ke Tao
  • Claudia Hauff
  • Geert-Jan Houben
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8281)


Queries that users pose to search engines are often ambiguous - either because different users express different query intents with the same query terms or because the query is underspecified and it is unclear which aspect of a particular query the user is interested in. In the Web search setting, search result diversification, whose goal is the creation of a search result ranking covering a range of query intents or aspects of a single topic respectively, has been shown in recent years to be an effective strategy to satisfy search engine users. We hypothesize that such a strategy will also be beneficial for search on microblogging platforms. Currently, progress in this direction is limited due to the lack of a microblog-based diversification corpus. In this paper we address this shortcoming and present our work on creating such a corpus. We are able to show that this corpus fulfils a number of diversification criteria as described in the literature. Initial search and retrieval experiments evaluating the benefits of de-duplication in the diversification setting are also reported.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Clarke, C.L.A., Craswell, N., Soboroff, I.: Overview of the trec 2009 web track. In: TREC 2009 (2009)Google Scholar
  2. 2.
    Carterette, B., Chandar, P.: Probabilistic models of ranking novel documents for faceted topic retrieval. In: CIKM 2009, pp. 1287–1296 (2009)Google Scholar
  3. 3.
    Slivkins, A., Radlinski, F., Gollapudi, S.: Learning optimally diverse rankings over large document collections. In: ICML 2010, pp. 983–990 (2010)Google Scholar
  4. 4.
    Santos, R.L.T., Macdonald, C., Ounis, I.: Intent-aware search result diversification. In: SIGIR 2011, pp. 595–604 (2011)Google Scholar
  5. 5.
    Santos, R.L.T., Macdonald, C., Ounis, I.: Aggregated search result diversification. In: Amati, G., Crestani, F. (eds.) ICTIR 2011. LNCS, vol. 6931, pp. 250–261. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  6. 6.
    Teevan, J., Ramage, D., Morris, M.R.: #TwitterSearch: a comparison of microblog search and web search. In: WSDM 2011, pp. 35–44 (2011)Google Scholar
  7. 7.
    Tao, K., Abel, F., Hauff, C., Houben, G.J., Gadiraju, U.: Groundhog day: Near-duplicate detection on twitter. In: WWW 2013, pp. 1273–1284 (2013)Google Scholar
  8. 8.
    Cronen-Townsend, S., Croft, W.B.: Quantifying query ambiguity. In: HLT 2002, pp. 104–109 (2002)Google Scholar
  9. 9.
    Bennett, P.N., Carterette, B., Chapelle, O., Joachims, T.: Beyond binary relevance: preferences, diversity, and set-level judgments. SIGIR Forum 42(2), 53–58 (2008)CrossRefGoogle Scholar
  10. 10.
    Carbonell, J., Goldstein, J.: The use of mmr, diversity-based reranking for reordering documents and producing summaries. In: SIGIR 1998, pp. 335–336 (1998)Google Scholar
  11. 11.
    Zhai, C., Lafferty, J.: A risk minimization framework for information retrieval. Inf. Process. Manage. 42(1), 31–55 (2006)CrossRefMATHGoogle Scholar
  12. 12.
    Yue, Y., Joachims, T.: Predicting diverse subsets using structural svms. In: ICML 2008, pp. 1224–1231 (2008)Google Scholar
  13. 13.
    Agrawal, R., Gollapudi, S., Halverson, A., Ieong, S.: Diversifying search results. In: WSDM 2009, pp. 5–14 (2009)Google Scholar
  14. 14.
    Clarke, C.L., Kolla, M., Cormack, G.V., Vechtomova, O., Ashkan, A., Büttcher, S., MacKinnon, I.: Novelty and diversity in information retrieval evaluation. In: SIGIR 2008, pp. 659–666 (2008)Google Scholar
  15. 15.
    Chapelle, O., Metlzer, D., Zhang, Y., Grinspan, P.: Expected reciprocal rank for graded relevance. In: CIKM 2009, pp. 621–630 (2009)Google Scholar
  16. 16.
    Clarke, C.L.A., Kolla, M., Vechtomova, O.: An effectiveness measure for ambiguous and underspecified queries. In: Azzopardi, L., Kazai, G., Robertson, S., Rüger, S., Shokouhi, M., Song, D., Yilmaz, E. (eds.) ICTIR 2009. LNCS, vol. 5766, pp. 188–199. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  17. 17.
    Tao, K., Abel, F., Hauff, C., Houben, G.J.: What makes a tweet relevant for a topic? In: #MSM2012 Workshop, pp. 49–56 (2012)Google Scholar
  18. 18.
    Golbus, P., Aslam, J., Clarke, C.: Increasing evaluation sensitivity to diversity. Information Retrieval, 1–26 (2013)Google Scholar
  19. 19.
    Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of ir techniques. ACM Trans. Inf. Syst. 20(4), 422–446 (2002)CrossRefGoogle Scholar
  20. 20.
    Zhai, C.X., Cohen, W.W., Lafferty, J.: Beyond independent relevance: methods and evaluation metrics for subtopic retrieval. In: SIGIR 2003, pp. 10–17 (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Ke Tao
    • 1
  • Claudia Hauff
    • 1
  • Geert-Jan Houben
    • 1
  1. 1.Web Information SystemsTU DelftDelftThe Netherlands

Personalised recommendations