Advertisement

Enhanced N-Gram Extraction Using Relevance Feature Discovery

  • Mubarak Albathan
  • Yuefeng Li
  • Abdulmohsen Algarni
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8272)

Abstract

Guaranteeing the quality of extracted features that describe relevant knowledge to users or topics is a challenge because of the large number of extracted features. Most popular existing term-based feature selection methods suffer from noisy feature extraction, which is irrelevant to the user needs (noisy). One popular method is to extract phrases or n-grams to describe the relevant knowledge. However, extracted n-grams and phrases usually contain a lot of noise. This paper proposes a method for reducing the noise in n-grams. The method first extracts more specific features (terms) to remove noisy features. The method then uses an extended random set to accurately weight n-grams based on their distribution in the documents and their terms distribution in n-grams. The proposed approach not only reduces the number of extracted n-grams but also improves the performance. The experimental results on Reuters Corpus Volume 1 (RCV1) data collection and TREC topics show that the proposed method significantly outperforms the state-of-art methods underpinned by Okapi BM25, tf*idf and Rocchio.

Keywords

Feature selection relevance feedback terms weight n-gram extraction 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Wang, X., Fang, H., Zhai, C.: A study of methods for negative relevance feedback. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 219–226. ACM (2008)Google Scholar
  2. 2.
    Wang, X., McCallum, A., Wei, X.: Topical n-grams: Phrase and topic discovery, with an application to information retrieval. In: Seventh IEEE International Conference on Data Mining, ICDM 2007, pp. 697–702. IEEE (2007)Google Scholar
  3. 3.
    Liu, H., Motoda, H., Setiono, R., Zhao, Z.: Feature selection: An ever evolving frontier in data mining. In: Proc. The Fourth Workshop on Feature Selection in Data Mining, vol. 4, pp. 4–13 (2010)Google Scholar
  4. 4.
    Li, Y., Zhong, N.: Mining ontology for automatically acquiring web user information needs. IEEE Transactions on Knowledge and Data Engineering 18(4), 554–568 (2006)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Berry, M.W., Kogan, J.: Text mining: applications and theory. Wiley (2010)Google Scholar
  6. 6.
    Jones, K.S.: A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28(1), 11–21 (1972)CrossRefGoogle Scholar
  7. 7.
    Hulth, A.: Improved automatic keyword extraction given more linguistic knowledge. In: Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, pp. 216–223. Association for Computational Linguistics (2003)Google Scholar
  8. 8.
    Tandon, N., de Melo, G.: Information extraction from web-scale n-gram data. In: Web N-gram Workshop, vol. 7, Citeseer (2010)Google Scholar
  9. 9.
    Wei, Z., Chauchat, J., Miao, D.: Comparing different text representation and feature selection methods on chinese text classification using character n-grams. Journées Internationnales d’Analyse des Données Textuelles, 1175–1186 (2008)Google Scholar
  10. 10.
    Manning, C.D., Raghavan, P., Schütze, H.: Introduction to information retrieval, vol. 1. Cambridge University Press, Cambridge (2008)CrossRefzbMATHGoogle Scholar
  11. 11.
    Hiemstra, D.: Using language models for information retrieval. Taaluitgeverij Neslia Paniculata (2001)Google Scholar
  12. 12.
    Wang, K., Thrasher, C., Viegas, E., Li, X., Hsu, B.j.P.: An overview of microsoft web n-gram corpus and applications. In: Proceedings of the NAACL HLT 2010 Demonstration Session, pp. 45–48. Association for Computational Linguistics (2010)Google Scholar
  13. 13.
    Li, Y., Algarni, A., Zhong, N.: Mining positive and negative patterns for relevance feature discovery. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2010, pp. 753–762. ACM, New York (2010)Google Scholar
  14. 14.
    Wu, S.T.: Knowledge discovery using pattern taxonomy model in text mining. PhD thesis, Queensland University of Technology (2007)Google Scholar
  15. 15.
    Liu, B.: Web data mining: exploring hyperlinks, contents, and usage data. Springer (2007)Google Scholar
  16. 16.
    Wei, Z., Miao, D., Chauchat, J.H., Zhao, R., Li, W.: N-grams based feature selection and text representation for chinese text classification. International Journal of Computational Intelligence Systems 2(4), 365–374 (2009)Google Scholar
  17. 17.
    Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. The Journal of Machine Learning Research 3, 1157–1182 (2003)zbMATHGoogle Scholar
  18. 18.
    Fürnkranz, J.: A study using n-gram features for text categorization. Austrian Research Institute for Artifical Intelligence 3(1998), 1–10 (1998)Google Scholar
  19. 19.
    Bertolami, R., Bunke, H.: Integration of n-gram language models in multiple classifier systems for offline handwritten text line recognition. International Journal of Pattern Recognition and Artificial Intelligence 22(07), 1301–1321 (2008)CrossRefGoogle Scholar
  20. 20.
    Li, Y.: Extended random sets for knowledge discovery in information systems. In: Wang, G., Liu, Q., Yao, Y., Skowron, A. (eds.) RSFDGrC 2003. LNCS (LNAI), vol. 2639, pp. 524–532. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  21. 21.
    Joachims, T.: A probabilistic analysis of the rocchio algorithm with tfidf for text categorization. Technical report, DTIC Document (1996)Google Scholar
  22. 22.
    Robertson, S., Soboroff, I.: The trec 2002 filtering track report. In: Text REtrieval Conference (2002)Google Scholar
  23. 23.
    Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information processing & management 24(5), 513–523 (1988)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2013

Authors and Affiliations

  • Mubarak Albathan
    • 1
    • 2
  • Yuefeng Li
    • 1
  • Abdulmohsen Algarni
    • 3
  1. 1.School of Electrical Engineering and Computer ScienceQueensland University of TechnologyBrisbaneAustralia
  2. 2.Al Imam Mohammad Ibn Saud Islamic UniversityRiyadhSaudi Arabia
  3. 3.College of Computer ScienceKing Khaled UniversityAbhaSaudi Arabia

Personalised recommendations