On the Effect of Stopword Removal for SMS-Based FAQ Retrieval

  • Johannes Leveling
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7337)

Abstract

This paper investigates the effects of stopword removal in different stages of a system for SMS-based FAQ retrieval. Experiments are performed on the FIRE 2011 monolingual English data. The FAQ system comprises several stages, including normalization and correction of SMS, retrieval of FAQs potentially containing answers using the BM25 retrieval model, and detection of out-of-domain queries based on a k nearest-neighbor classifier. Both retrieval and OOD detection are tested with different stopword lists. Results indicate that i) retrieval performance is highest when stopwords are not removed and decreases when longer stopword lists are employed, ii) OOD detection accuracy decreases when trained on features collected during retrieval using no stopwords, iii) a combination of retrieval using no stopwords and OOD detection trained using the SMART stopwords yields the best results: 75.1% in-domain queries are answered correctly and 85.6% OOD queries are detected correctly.

Keywords

Short Message Service Retrieval Performance Stopword Removal Short Message Service Message Information Retrieval Evaluation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Daelemans, W., Zavrel, J., van der Sloot, K., van den Bosch, A.: TiMBL: Tilburg memory based learner, version 6.2, reference guide. Technical Report 09-01, ILK (2004)Google Scholar
  2. 2.
    Dolamic, L., Savoy, J.: When stopword lists make the difference. JASIST 61(1), 200–203 (2010)CrossRefGoogle Scholar
  3. 3.
    El-Khair, I.A.: Effects of stop words elimination for Arabic information retrieval: A comparative study. International Journal of Computing & Information Sciences 4(3), 119–133 (2006)Google Scholar
  4. 4.
    Ferguson, P., Hare, N.O., Lanagan, J., Smeaton, A., Phelan, O., McCarthy, K., Smyth, B.: CLARITY at the TREC 2011 Microblog Track. In: Proceedings of the 20th TREC Conference. National Institute of Standards and Technology (NIST), Gaithersburg, MD, USA (2011)Google Scholar
  5. 5.
    Fox, C.J.: A stop list for general text. SIGIR Forum 24(1-2), 19–35 (1990)CrossRefGoogle Scholar
  6. 6.
    Harter, S.P.: Online information retrieval. Concepts, principles, and techniques. Academic Press (1986)Google Scholar
  7. 7.
    Hogan, D., Leveling, J., Wang, H., Ferguson, P., Gurrin, C.: DCU@FIRE 2011: SMS-based FAQ retrieval. In: 3rd Workshop of the Forum for Information Retrieval Evaluation, FIRE 2011, IIT Bombay, December 2-4, pp. 34–42 (2011)Google Scholar
  8. 8.
    Kothari, G., Negi, S., Faruquie, T.A., Chakaravarthy, V.T., Subramaniam, L.V.: SMS based interface for FAQ retrieval. In: ACL/IJNLP 2009, pp. 852–860 (2009)Google Scholar
  9. 9.
    Lo, R.T.W., He, B., Ounis, I.: Automatically building a stopword list for an information retrieval system. JDIM 3(1), 3–8 (2005)Google Scholar
  10. 10.
    Robertson, S.E., Walker, S., Jones, S., Beaulieu, M.M.H., Gatford, M.: Okapi at TREC-3. In: Harman, D.K. (ed.) Overview of the Third Text Retrieval Conference (TREC-3), pp. 109–126. National Institute of Standards and Technology (NIST), Gaithersburg (1995)Google Scholar
  11. 11.
    Tagg, C.: A corpus linguistics study of SMS text messaging. Ph.D. thesis, University of Birmingham (2009)Google Scholar
  12. 12.
    Zou, F., Wang, F.L., Deng, X., Han, S., Wang, L.S.: Automatic construction of Chinese stop word list. In: 5th WSEAS International Conference on Applied Computer Science (2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Johannes Leveling
    • 1
  1. 1.Centre for Next Generation Localisation (CNGL)Dublin City UniversityDublin 9Ireland

Personalised recommendations