Skip to main content

Studying Online and Offline Evaluation Measures: A Case Study Based on the NTCIR-14 OpenLiveQ-2 Task

  • 213 Accesses

Part of the Lecture Notes in Computer Science book series (LNISA,volume 11966)

Abstract

We describe our participation in the NTCIR-14 OpenLiveQ-2 task and our post-submission investigations. For a given query and a set of questions with their answers, participants in the OpenLiveQ task were required to return a ranked list of questions that potentially match and satisfy the user’s query effectively. In this paper we focus on two main investigations: (i) Finding effective features which go beyond only-relevance for the task of ranking questions for a given query in Japanese language. (ii) Analyzing the nature and relationship of online and offline evaluation measures. We use the OpenLiveQ-2 dataset for our study. Our first investigation examines user log-based features (e.g number of views, question is solved) and content-based features (BM25 scores, LM scores). Overall, we find that log-based features reflecting the question’s popularity, freshness, etc dominate question ranking, rather than content-based features measuring query and question similarity. Our second investigation finds that the offline measures highly correlate among themselves, but that the correlation between different offline and online measures is quite low. We find that the low correlation between online and offline measures is also reflected in discrepancies between the systems’ rankings for the OpenLiveQ-2 task, although this depends on the nature and type of the evaluation measures.

Keywords

  • Learning To Rank models
  • Question-answer ranking
  • Online and offline testing
  • Correlation of online and offline measures

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-030-36805-0_6
  • Chapter length: 13 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   44.99
Price excludes VAT (USA)
  • ISBN: 978-3-030-36805-0
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   59.99
Price excludes VAT (USA)

Notes

  1. 1.

    https://chiebukuro.yahoo.co.jp/.

  2. 2.

    https://scikit-learn.org/stable/.

  3. 3.

    https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html.

  4. 4.

    Similar pattern of results were observed using Spearman’s and Kendall’s Tau correlation metrics during our investigation, results have been omitted because of the space constraints.

References

  1. Chapelle, O., Metlzer, D., Zhang, Y., Grinspan, P.: Expected reciprocal rank for graded relevance. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM, pp. 621–630 (2009)

    Google Scholar 

  2. Dang, V.: The Lemur Project-Wiki-Ranklib (2013). http://sourceforge.net/p/lemur/wiki/RankLib

  3. Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. (TOIS) 20(4), 422–446 (2002)

    CrossRef  Google Scholar 

  4. Kato, M.P., Liu, Y.: Overview of NTCIR-13. In: Proceedings of the 13th NTCIR Conference on Evaluation of Information Access Technologies (2017)

    Google Scholar 

  5. Joachims, T., Granka, L.A., Pan, B., Hembrooke, H., Gay, G.: Accurately interpreting clickthrough data as implicit feedback. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 154–161. SIGIR (2005)

    Google Scholar 

  6. Kato, M.P., Manabe, T., Fujita, S., Nishida, A., Yamamoto, T.: Challenges of multileaved comparison in practice: lessons from NTCIR-13 OpenLiveQ Task. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management, CIKM, pp. 1515–1518 (2018)

    Google Scholar 

  7. Kato, M.P., Nishida, A., Manabe, T., Fujita, S., Yamamoto, T.: Overview of the NTCIR-14 OpenLiveQ-2 task. In: Proceedings of the 14th NTCIR Conference on Evaluation of Information Access Technologies (2019)

    Google Scholar 

  8. Manabe, T., Nishida, A., Fujita, S.: YJRS at the NTCIR-13 OpenLiveQ task. In: Proceedings of the 13th NTCIR Conference on Evaluation of Information Access Technologies (2017)

    Google Scholar 

  9. Arora, P., Jones, G.J.F.: DCU at the NTCIR-14 OpenLiveQ-2 task. In: Proceedings of the 14th NTCIR Conference on Evaluation of Information Access Technologies (2019)

    Google Scholar 

  10. Qin, T., Liu, T.Y., Xu, J., Li, H.: LETOR: a benchmark collection for research on learning to rank for information retrieval. J. Inf. Retrieval 13(4), 346–374 (2010)

    CrossRef  Google Scholar 

  11. Metzler, D., Croft, W.B.: Linear feature-based models for information retrieval. J. Inf. Retrieval 10(3), 257–274 (2007)

    CrossRef  Google Scholar 

  12. Oosterhuis, H., de Rijke, M.: Sensitive and scalable online evaluation with theoretical guarantees. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM, pp. 77–86 (2017)

    Google Scholar 

  13. Ponte, J.M., Croft, W.B.: A language modeling approach to information retrieval. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 275–281. SIGIR (1998)

    Google Scholar 

  14. Robertson, S., Walker, S., Jones, S., Hancock-Beaulieu, M., Gatford, M.: Okapi at TREC-3. In: NIST Special Publication, no. 500225, pp. 109–123 (1995)

    Google Scholar 

  15. Sakai, T.: Evaluating evaluation metrics based on the bootstrap. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 525–532. SIGIR (2006)

    Google Scholar 

  16. Sakai, T.: On the reliability of information retrieval metrics based on graded relevance. J. Inf. Process. Manag. 43(2), 531–548 (2007)

    CrossRef  Google Scholar 

  17. Breiman, L.: Some properties of splitting criteria. J. Mach. Learn. 24(1), 41–47 (1996)

    MATH  Google Scholar 

Download references

Acknowledgement

This research is supported by Science Foundation Ireland (SFI) as a part of the ADAPT Centre at Dublin City University (Grant No: 12/CE/I2267).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Piyush Arora .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Arora, P., Jones, G.J.F. (2019). Studying Online and Offline Evaluation Measures: A Case Study Based on the NTCIR-14 OpenLiveQ-2 Task. In: Kato, M., Liu, Y., Kando, N., Clarke, C. (eds) NII Testbeds and Community for Information Access Research. NTCIR 2019. Lecture Notes in Computer Science(), vol 11966. Springer, Cham. https://doi.org/10.1007/978-3-030-36805-0_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-36805-0_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-36804-3

  • Online ISBN: 978-3-030-36805-0

  • eBook Packages: Computer ScienceComputer Science (R0)