Skip to main content

An Extended Random-Sets Model forĀ Fusion-Based Text Feature Selection

  • Conference paper
  • First Online:
Advances in Knowledge Discovery and Data Mining (PAKDD 2018)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10939))

Included in the following conference series:

Abstract

Selecting features that represent a specific corpus is important for the success of many machine learning and text mining applications. In information retrieval (IR), fusion-based techniques have shown remarkable performance compared to traditional models. However, in text feature selection (FS), popular models do not consider the fusion of the taxonomic features of the corpus. This research proposed an innovative and effective extended random-sets model for fusion-based FS. The model fused scores of different hierarchal features to accurately weight the representative words based on their appearance across the documents in the corpus and in several latent topics. The model was evaluated for information filtering (IF) using TREC topics and the standard RCV1 dataset. The results showed that the proposed model significantly outperformed eleven state-of-the-art baseline models in six evaluation metrics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Words, keywords and terms are used interchangeably in this paper.

  2. 2.

    SIF stands for Selection of Informative Features, and the ā€˜2ā€™ refers to the utilisation of both local and global statistics.

  3. 3.

    http://trec.nist.gov/.

  4. 4.

    https://www.lemurproject.org/.

References

  1. Albathan, M., Li, Y., Xu, Y.: Using extended random set to find specific patterns. In: WI 2014, vol. 2, pp. 30ā€“37. IEEE (2014)

    Google ScholarĀ 

  2. Algarni, A., Li, Y.: Mining specific features for acquiring user information needs. In: Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G. (eds.) PAKDD 2013 Part I. LNCS (LNAI), vol. 7818, pp. 532ā€“543. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37453-1_44

    ChapterĀ  Google ScholarĀ 

  3. Alharbi, A.S., Li, Y., Xu, Y.: Integrating LDA with clustering technique for relevance feature selection. In: Peng, W., Alahakoon, D., Li, X. (eds.) AI 2017. LNCS (LNAI), vol. 10400, pp. 274ā€“286. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-63004-5_22

    ChapterĀ  Google ScholarĀ 

  4. Anava, Y., Shtok, A., Kurland, O., Rabinovich, E.: A probabilistic fusion framework. In: CIKM 2016, pp. 1463ā€“1472. ACM (2016)

    Google ScholarĀ 

  5. Bashar, M.A., Li, Y.: Random set to interpret topic models in terms of ontology concepts. In: Peng, W., Alahakoon, D., Li, X. (eds.) AI 2017. LNCS (LNAI), vol. 10400, pp. 237ā€“249. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-63004-5_19

    ChapterĀ  Google ScholarĀ 

  6. Bashar, M.A., Li, Y., Gao, Y.: A framework for automatic personalised ontology learning. In: WI 2016, pp. 105ā€“112. IEEE (2016)

    Google ScholarĀ 

  7. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993ā€“1022 (2003)

    MATHĀ  Google ScholarĀ 

  8. Buckley, C., Voorhees, E.M.: Evaluating evaluation measure stability. In: SIGIR 2000, pp. 33ā€“40. ACM (2000)

    Google ScholarĀ 

  9. Croft, W.B.: Combining approaches to information retrieval. In: Croft, W.B. (ed.) Advances in Information Retrieval. INRE, vol. 7, pp. 1ā€“36. Springer, Boston (2002). https://doi.org/10.1007/0-306-47019-5_1

    ChapterĀ  Google ScholarĀ 

  10. Gao, Y., Xu, Y., Li, Y.: Pattern-based topic models for information filtering. In: ICDM 2013, pp. 921ā€“928. IEEE (2013)

    Google ScholarĀ 

  11. Gao, Y., Xu, Y., Li, Y.: Topical pattern based document modelling and relevance ranking. In: Benatallah, B., Bestavros, A., Manolopoulos, Y., Vakali, A., Zhang, Y. (eds.) WISE 2014 Part I. LNCS, vol. 8786, pp. 186ā€“201. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11749-2_15

    ChapterĀ  Google ScholarĀ 

  12. Gao, Y., Xu, Y., Li, Y.: Pattern-based topics for document modelling in information filtering. IEEE TKDE 27(6), 1629ā€“1642 (2015)

    Google ScholarĀ 

  13. Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 42(1ā€“2), 177ā€“196 (2001)

    ArticleĀ  Google ScholarĀ 

  14. Joachims, T.: Optimizing search engines using clickthrough data. In: KDD 2002, pp. 133ā€“142. ACM (2002)

    Google ScholarĀ 

  15. Lan, M., Tan, C.L., Su, J., Lu, Y.: Supervised and traditional term weighting methods for automatic text categorization. IEEE TPAMI 31(4), 721ā€“735 (2009)

    ArticleĀ  Google ScholarĀ 

  16. Li, Y.: Extended random sets for knowledge discovery in information systems. In: Wang, G., Liu, Q., Yao, Y., Skowron, A. (eds.) RSFDGrC 2003. LNCS (LNAI), vol. 2639, pp. 524ā€“532. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-39205-X_87

    ChapterĀ  Google ScholarĀ 

  17. Li, Y., Algarni, A., Albathan, M., Shen, Y., Bijaksana, M.A.: Relevance feature discovery for text mining. IEEE TKDE 27(6), 1656ā€“1669 (2015)

    Google ScholarĀ 

  18. Li, Y., Algarni, A., Zhong, N.: Mining positive and negative patterns for relevance feature discovery. In: KDD 2010, pp. 753ā€“762. ACM (2010)

    Google ScholarĀ 

  19. Li, Y., Li, T., Liu, H.: Recent advances in feature selection and its applications. Knowl. Inf. Syst. 53, 1ā€“27 (2017)

    ArticleĀ  Google ScholarĀ 

  20. Macdonald, C., Ounis, I.: Global statistics in proximity weighting models. In: Web N-gram Workshop. p. 30. Citeseer (2010)

    Google ScholarĀ 

  21. Manning, C.D., Raghavan, P., SchĆ¼tze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)

    BookĀ  Google ScholarĀ 

  22. Maxwell, K.T., Croft, W.B.: Compact query term selection using topically related text. In: SIGIR 2013, pp. 583ā€“592. ACM (2013)

    Google ScholarĀ 

  23. McCallum, A.K.: Mallet: a machine learning for language toolkit (2002)

    Google ScholarĀ 

  24. Molchanov, I.: Theory of Random Sets. Springer, Heidelberg (2006). https://doi.org/10.1007/1-84628-150-4

    BookĀ  Google ScholarĀ 

  25. Nguyen, H.T.: Random sets. Scholarpedia 3(7), 3383 (2008)

    ArticleĀ  Google ScholarĀ 

  26. Robertson, S., Zaragoza, H.: The Probabilistic Relevance Framework: BM25 and Beyond. Now Publishers Inc., Breda (2009)

    Google ScholarĀ 

  27. Robertson, S.E., Soboroff, I.: The TREC 2002 filtering track report. In: TREC, vol. 2002, p. 5 (2002)

    Google ScholarĀ 

  28. Steyvers, M., Griffiths, T.: Probabilistic topic models. Handb. Latent Semant. Anal. 427(7), 424ā€“440 (2007)

    Google ScholarĀ 

  29. Wang, X., McCallum, A., Wei, X.: Topical n-grams: phrase and topic discovery, with an application to information retrieval. In: ICDM 2007, pp. 697ā€“702. IEEE (2007)

    Google ScholarĀ 

  30. Wu, S.: Data Fusion in Information Retrieval. Springer, Heidelberg (2012)

    BookĀ  Google ScholarĀ 

  31. Zhang, S., Balog, K.: Design patterns for fusion-based object retrieval. In: Jose, J.M., Hauff, C., Altıngovde, I.S., Song, D., Albakour, D., Watt, S., Tait, J. (eds.) ECIR 2017. LNCS, vol. 10193, pp. 684ā€“690. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-56608-5_66

    ChapterĀ  Google ScholarĀ 

  32. Zhong, N., Li, Y., Wu, S.T.: Effective pattern discovery for text mining. IEEE TKDE 24(1), 30ā€“44 (2012)

    Google ScholarĀ 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Abdullah Semran Alharbi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

Ā© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Alharbi, A.S., Li, Y., Xu, Y. (2018). An Extended Random-Sets Model forĀ Fusion-Based Text Feature Selection. In: Phung, D., Tseng, V., Webb, G., Ho, B., Ganji, M., Rashidi, L. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2018. Lecture Notes in Computer Science(), vol 10939. Springer, Cham. https://doi.org/10.1007/978-3-319-93040-4_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-93040-4_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-93039-8

  • Online ISBN: 978-3-319-93040-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics