Advertisement

An Extended Random-Sets Model for Fusion-Based Text Feature Selection

  • Abdullah Semran Alharbi
  • Yuefeng Li
  • Yue Xu
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10939)

Abstract

Selecting features that represent a specific corpus is important for the success of many machine learning and text mining applications. In information retrieval (IR), fusion-based techniques have shown remarkable performance compared to traditional models. However, in text feature selection (FS), popular models do not consider the fusion of the taxonomic features of the corpus. This research proposed an innovative and effective extended random-sets model for fusion-based FS. The model fused scores of different hierarchal features to accurately weight the representative words based on their appearance across the documents in the corpus and in several latent topics. The model was evaluated for information filtering (IF) using TREC topics and the standard RCV1 dataset. The results showed that the proposed model significantly outperformed eleven state-of-the-art baseline models in six evaluation metrics.

Keywords

Feature selection Data fusion Topic modelling Term weighting Extended Random Set 

References

  1. 1.
    Albathan, M., Li, Y., Xu, Y.: Using extended random set to find specific patterns. In: WI 2014, vol. 2, pp. 30–37. IEEE (2014)Google Scholar
  2. 2.
    Algarni, A., Li, Y.: Mining specific features for acquiring user information needs. In: Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G. (eds.) PAKDD 2013 Part I. LNCS (LNAI), vol. 7818, pp. 532–543. Springer, Heidelberg (2013).  https://doi.org/10.1007/978-3-642-37453-1_44CrossRefGoogle Scholar
  3. 3.
    Alharbi, A.S., Li, Y., Xu, Y.: Integrating LDA with clustering technique for relevance feature selection. In: Peng, W., Alahakoon, D., Li, X. (eds.) AI 2017. LNCS (LNAI), vol. 10400, pp. 274–286. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-63004-5_22CrossRefGoogle Scholar
  4. 4.
    Anava, Y., Shtok, A., Kurland, O., Rabinovich, E.: A probabilistic fusion framework. In: CIKM 2016, pp. 1463–1472. ACM (2016)Google Scholar
  5. 5.
    Bashar, M.A., Li, Y.: Random set to interpret topic models in terms of ontology concepts. In: Peng, W., Alahakoon, D., Li, X. (eds.) AI 2017. LNCS (LNAI), vol. 10400, pp. 237–249. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-63004-5_19CrossRefGoogle Scholar
  6. 6.
    Bashar, M.A., Li, Y., Gao, Y.: A framework for automatic personalised ontology learning. In: WI 2016, pp. 105–112. IEEE (2016)Google Scholar
  7. 7.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)zbMATHGoogle Scholar
  8. 8.
    Buckley, C., Voorhees, E.M.: Evaluating evaluation measure stability. In: SIGIR 2000, pp. 33–40. ACM (2000)Google Scholar
  9. 9.
    Croft, W.B.: Combining approaches to information retrieval. In: Croft, W.B. (ed.) Advances in Information Retrieval. INRE, vol. 7, pp. 1–36. Springer, Boston (2002).  https://doi.org/10.1007/0-306-47019-5_1CrossRefGoogle Scholar
  10. 10.
    Gao, Y., Xu, Y., Li, Y.: Pattern-based topic models for information filtering. In: ICDM 2013, pp. 921–928. IEEE (2013)Google Scholar
  11. 11.
    Gao, Y., Xu, Y., Li, Y.: Topical pattern based document modelling and relevance ranking. In: Benatallah, B., Bestavros, A., Manolopoulos, Y., Vakali, A., Zhang, Y. (eds.) WISE 2014 Part I. LNCS, vol. 8786, pp. 186–201. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-11749-2_15CrossRefGoogle Scholar
  12. 12.
    Gao, Y., Xu, Y., Li, Y.: Pattern-based topics for document modelling in information filtering. IEEE TKDE 27(6), 1629–1642 (2015)Google Scholar
  13. 13.
    Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 42(1–2), 177–196 (2001)CrossRefGoogle Scholar
  14. 14.
    Joachims, T.: Optimizing search engines using clickthrough data. In: KDD 2002, pp. 133–142. ACM (2002)Google Scholar
  15. 15.
    Lan, M., Tan, C.L., Su, J., Lu, Y.: Supervised and traditional term weighting methods for automatic text categorization. IEEE TPAMI 31(4), 721–735 (2009)CrossRefGoogle Scholar
  16. 16.
    Li, Y.: Extended random sets for knowledge discovery in information systems. In: Wang, G., Liu, Q., Yao, Y., Skowron, A. (eds.) RSFDGrC 2003. LNCS (LNAI), vol. 2639, pp. 524–532. Springer, Heidelberg (2003).  https://doi.org/10.1007/3-540-39205-X_87CrossRefGoogle Scholar
  17. 17.
    Li, Y., Algarni, A., Albathan, M., Shen, Y., Bijaksana, M.A.: Relevance feature discovery for text mining. IEEE TKDE 27(6), 1656–1669 (2015)Google Scholar
  18. 18.
    Li, Y., Algarni, A., Zhong, N.: Mining positive and negative patterns for relevance feature discovery. In: KDD 2010, pp. 753–762. ACM (2010)Google Scholar
  19. 19.
    Li, Y., Li, T., Liu, H.: Recent advances in feature selection and its applications. Knowl. Inf. Syst. 53, 1–27 (2017)CrossRefGoogle Scholar
  20. 20.
    Macdonald, C., Ounis, I.: Global statistics in proximity weighting models. In: Web N-gram Workshop. p. 30. Citeseer (2010)Google Scholar
  21. 21.
    Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)CrossRefGoogle Scholar
  22. 22.
    Maxwell, K.T., Croft, W.B.: Compact query term selection using topically related text. In: SIGIR 2013, pp. 583–592. ACM (2013)Google Scholar
  23. 23.
    McCallum, A.K.: Mallet: a machine learning for language toolkit (2002)Google Scholar
  24. 24.
    Molchanov, I.: Theory of Random Sets. Springer, Heidelberg (2006).  https://doi.org/10.1007/1-84628-150-4CrossRefGoogle Scholar
  25. 25.
    Nguyen, H.T.: Random sets. Scholarpedia 3(7), 3383 (2008)CrossRefGoogle Scholar
  26. 26.
    Robertson, S., Zaragoza, H.: The Probabilistic Relevance Framework: BM25 and Beyond. Now Publishers Inc., Breda (2009)Google Scholar
  27. 27.
    Robertson, S.E., Soboroff, I.: The TREC 2002 filtering track report. In: TREC, vol. 2002, p. 5 (2002)Google Scholar
  28. 28.
    Steyvers, M., Griffiths, T.: Probabilistic topic models. Handb. Latent Semant. Anal. 427(7), 424–440 (2007)Google Scholar
  29. 29.
    Wang, X., McCallum, A., Wei, X.: Topical n-grams: phrase and topic discovery, with an application to information retrieval. In: ICDM 2007, pp. 697–702. IEEE (2007)Google Scholar
  30. 30.
    Wu, S.: Data Fusion in Information Retrieval. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  31. 31.
    Zhang, S., Balog, K.: Design patterns for fusion-based object retrieval. In: Jose, J.M., Hauff, C., Altıngovde, I.S., Song, D., Albakour, D., Watt, S., Tait, J. (eds.) ECIR 2017. LNCS, vol. 10193, pp. 684–690. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-56608-5_66CrossRefGoogle Scholar
  32. 32.
    Zhong, N., Li, Y., Wu, S.T.: Effective pattern discovery for text mining. IEEE TKDE 24(1), 30–44 (2012)Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.School of EECSQueensland University of TechnologyBrisbaneAustralia
  2. 2.Department of CSUmm Al-Qura UniversityMeccaSaudi Arabia

Personalised recommendations