Topic Set Size Design with the Evaluation Measures for Short Text Conversation

  • Tetsuya SakaiEmail author
  • Lifeng Shang
  • Zhengdong Lu
  • Hang Li
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9460)


Short Text Conversation (STC) is a new NTCIR task which tackles the following research question: given a microblog repository and a new post to that microblog, can systems reuse an old comment from the respository to satisfy the author of the new post? The official evaluation measures of STC are normalised gain at 1 (nG@1), normalised expected reciprocal rank at 10 (nERR@10), and P \(^+\), all of which can be regarded as evaluation measures for navigational intents. In this study, we apply the topic set size design technique of Sakai to decide on the number of test topics, using variance estimates of the above evaluation measures. Our main conclusion is to create 100 test topics, but what distinguishes our work from other tasks with similar topic set sizes is that we know what this topic set size means from a statistical viewpoint for each of our evaluation measures. We also demonstrate that, under the same set of statistical requirements, the topic set sizes required by nERR@10 and P\(^+\) are more or less the same, while nG@1 requires more than twice as many topics. To our knowledge, our task is the first among all efforts at TREC-like evaluation conferences to actually create a new test collection by using this principled approach.


Evaluation Measure Relevant Document Test Topic Question Answering Test Collection 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Broder, A.: A taxonomy of web search. SIGIR Forum 36(2), 3–10 (2002)CrossRefGoogle Scholar
  2. 2.
    Chapelle, O., Ji, S., Liao, C., Velipasaoglu, E., Lai, L., Wu, S.L.: Intent-based diversification of web search results: metrics and algorithms. Inf. Retrieval 14(6), 572–592 (2011)CrossRefGoogle Scholar
  3. 3.
    Ellis, P.D.: The Essential Guide to Effect Sizes. Cambridge University Press, New York (2010)Google Scholar
  4. 4.
    Higashinaka, R., Kawamae, N., Sadamitsu, K., Minami, Y., Meguro, T., Dohsaka, K., Inagaki, H.: Building a conversational model from two-tweets. In: Proceedings of IEEE ASRU 2011 (2011)Google Scholar
  5. 5.
    Jafarpour, S., Burges, C.J.: Filter, rank and transfer the knowledge: learning to chat. Technical report, MSR-TR-2010-93 (2010)Google Scholar
  6. 6.
    Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of IR techniques. ACM TOIS 20(4), 422–446 (2002)CrossRefGoogle Scholar
  7. 7.
    Lin, J., Efron, M.: Overview of the TREC-2013 microblog track. In: Proceedings of TREC 2013 (2014)Google Scholar
  8. 8.
    Mitamura, T., Shima, H., Sakai, T., Kando, N., Mori, T., Takeda, K., Lin, C.Y., Song, R., Lin, C.J., Lee, C.W.: Overview of the NTCIR-8 ACLIA tasks: advanced cross-lingual information access. In: Proceedings of NTCIR-8, pp. 15–24 (2010)Google Scholar
  9. 9.
    Nagata, Y.: How to Design the Sample Size (in Japanese), Asakura Shoten (2003)Google Scholar
  10. 10.
    Ritter, A., Cherry, C., Dolan, W.B.: Data-driven response generation in social media. Proc. EMNLP 2011, 583–593 (2011)Google Scholar
  11. 11.
    Sakai, T.: Bootstrap-based comparisons of IR metrics for finding one relevant document. In: Ng, H.T., Leong, M.-K., Kan, M.-Y., Ji, D. (eds.) AIRS 2006. LNCS, vol. 4182, pp. 374–389. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  12. 12.
    Sakai, T.: Statistical reform in information retrieval? SIGIR Forum 48(1), 3–12 (2014)CrossRefGoogle Scholar
  13. 13.
    Sakai, T.: Information Access Evaluation Methodology: For the Progress of Search Engines (in Japanese), Coronasha (2015)Google Scholar
  14. 14.
    Sakai, T.: Topic set size design. Information Retrieval Journal (submitted) (2015)Google Scholar
  15. 15.
    Sakai, T., Ishikawa, D., Kando, N., Seki, Y., Kuriyama, K., Lin, C.Y.: Using graded-relevance metrics for evaluating community QA answer selection. Proc. ACM WSDM 2011, 187–196 (2011)Google Scholar
  16. 16.
    Sakai, T., Robertson, S.: Modelling a user population for designing information retrieval metrics. Proc. EVIA 2008, 30–41 (2008)Google Scholar
  17. 17.
    Sanderson, M., Zobel, J.: Information retrieval system evaluation: effort, sensitivity, and reliability. Proc. ACM SIGIR 2005, 162–169 (2005)Google Scholar
  18. 18.
    Shibuki, H., Sakamoto, K., Kano, Y., Mitamura, T., Ishioroshi, M., Itakura, K., Wang, D., Mori, T., Kando, N.: Overview of the NTCIR-11 QA-Lab task. In: Proceedings of NTCIR-11, pp. 518–529 (2014)Google Scholar
  19. 19.
    Voorhees, E.M., Harman, D.K. (eds.): TREC: Experiment and Evaluation in Information Retrieval. The MIT Press, Cambridge (2005)Google Scholar
  20. 20.
    Webber, W., Moffat, A., Zobel, J.: Statistical power in retrieval experimentation. Proc. ACM CIKM 2008, 571–580 (2008)Google Scholar
  21. 21.
    Weizenbaum, J.: ELIZA - a computer program for the study of natural language communication between man and machine. Commun. ACM 9(1), 36–45 (1966)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Tetsuya Sakai
    • 1
    Email author
  • Lifeng Shang
    • 2
  • Zhengdong Lu
    • 2
  • Hang Li
    • 2
  1. 1.Waseda UniversityTokyoJapan
  2. 2.Noah’s Ark LabHuaweiHong Kong

Personalised recommendations