Statistical and Comparative Evaluation of Various Indexing and Search Models

  • Samir Abdou
  • Jacques Savoy
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4182)


This paper first describes various strategies (character, bigram, automatic segmentation) used to index the Chinese (ZH), Japanese (JA) and Korean (KR) languages. Second, based on the NTCIR-5 test-collections, it evaluates various retrieval models, varying from classical vector-space models to more recent developments in probabilistic and language models. While no clear conclusion was reached for the Japanese language, the bigram-based indexing strategy seems to be the best choice for Korean, and the combined ”unigram & bigram” indexing strategy is best for traditional Chinese. On the other hand, Divergence from Randomness (DFR) probabilistic model usually results in the best mean average precision. Finally, upon an evaluation of the four different statistical tests, we find that their conclusions correlate, even more when comparing the non-parametric bootstrap with the t-test.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Kishida, K., Chen, K.-H., Lee, S., Kuriyama, K., Kando, N., Chen, H.-H., Myaeng, S.H.: Overview of CLIR Task at the Fifth NTCIR Workshop. In: Proceedings of NTCIR-5. NII, Tokyo, pp. 1–38 (2005)Google Scholar
  2. 2.
    Buckley, C., Singhal, A., Mitra, M., Salton, G.: New Retrieval Approaches using SMART. In: Proceedings TREC-4, NIST, Gaithersburg, pp. 25–48 (1996)Google Scholar
  3. 3.
    Singhal, A., Choi, J., Hindle, D., Lewis, D.D., Pereira, F.: AT&T at TREC-7. In: Proceedings TREC-7, NIST, Gaithersburg, pp. 239–251 (1999)Google Scholar
  4. 4.
    Robertson, S.E., Walker, S., Beaulieu, M.: Experimentation as aWay of Life: Okapi at TREC. Information Processing & Management 36, 95–108 (2000)CrossRefGoogle Scholar
  5. 5.
    Amati, G., van Rijsbergen, C.J.: Probabilistic Models of Information Retrieval Based on Measuring the Divergence from Randomness. ACM Transactions on Information Systems 20, 357–389 (2002)CrossRefGoogle Scholar
  6. 6.
    Hiemstra, D.: Using Language Models for Information Retrieval. CTIT Ph.D. Thesis (2000)Google Scholar
  7. 7.
    Peters, C., Clough, P.D., Gonzalo, J., Jones, G.J.F., Kluck, M., Magnini, B.: CLEF 2004. LNCS, vol. 3491. Springer, Berlin (2005)CrossRefGoogle Scholar
  8. 8.
    Kwok, K.L.: Employing Multiple Representations for Chinese Information Retrieval. Journal of the American Society for Information Science 50, 709–723 (1999)CrossRefGoogle Scholar
  9. 9.
    Luk, R.W.P., Kwok, K.L.: A Comparison of Chinese Document Indexing Strategies and Retrieval Models. ACM Transactions on Asian Languages Information Processing 1, 225–268 (2002)CrossRefGoogle Scholar
  10. 10.
    Lee, J.J., Cho, H.Y., Park, H.R.: N-gram-based Indexing for Korean Text Retrieval. Information Processing & Management 35, 427–441 (1999)CrossRefMathSciNetGoogle Scholar
  11. 11.
    Sproat, R.: Morphology and Computation. The MIT Press, Cambridge (1992)Google Scholar
  12. 12.
    Fujii, H., Croft, W.B.: A Comparison of Indexing Techniques for Japanese Text Retrieval. In: Proceedings ACM-SIGIR, pp. 237–246. The ACM Press, New York (1993)Google Scholar
  13. 13.
    Nie, J.Y., Ren, F.: Chinese Information Retrieval: using Characters or Words? Information Processing & Management 35, 443–462 (1999)CrossRefGoogle Scholar
  14. 14.
    Foo, S., Li, H.: Chinese Word Segmentation and its Effect on Information Retrieval. Information Processing & Management 40, 161–190 (2004)CrossRefGoogle Scholar
  15. 15.
    Murata, M., Ma, Q., Isahara, H.: Applying Multiple Characteristics and Techniques to Obtain High Levels of Performance in Information Retrieval. In: Proceedings of NTCIR-3, NII, Tokyo (2003)Google Scholar
  16. 16.
    Savoy, J.: Statistical Inference in Retrieval Effectiveness Evaluation. Information Processing & Management 33, 495–512 (1997)CrossRefGoogle Scholar
  17. 17.
    Conover, W.J.: Practical Nonparametric Statistics, 3rd edn. John Wiley & Sons, New York (1999)Google Scholar
  18. 18.
    Maindonald, J., Braun, J.: Data Analysis and Graphics Using R. Cambridge University Press, Cambridge (2003)zbMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Samir Abdou
    • 1
  • Jacques Savoy
    • 1
  1. 1.Computer Science DepartmentUniversity of NeuchatelNeuchatelSwitzerland

Personalised recommendations