Abstract
Selecting features that represent a specific corpus is important for the success of many machine learning and text mining applications. In information retrieval (IR), fusion-based techniques have shown remarkable performance compared to traditional models. However, in text feature selection (FS), popular models do not consider the fusion of the taxonomic features of the corpus. This research proposed an innovative and effective extended random-sets model for fusion-based FS. The model fused scores of different hierarchal features to accurately weight the representative words based on their appearance across the documents in the corpus and in several latent topics. The model was evaluated for information filtering (IF) using TREC topics and the standard RCV1 dataset. The results showed that the proposed model significantly outperformed eleven state-of-the-art baseline models in six evaluation metrics.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Words, keywords and terms are used interchangeably in this paper.
- 2.
SIF stands for Selection of Informative Features, and the ā2ā refers to the utilisation of both local and global statistics.
- 3.
- 4.
References
Albathan, M., Li, Y., Xu, Y.: Using extended random set to find specific patterns. In: WI 2014, vol. 2, pp. 30ā37. IEEE (2014)
Algarni, A., Li, Y.: Mining specific features for acquiring user information needs. In: Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G. (eds.) PAKDD 2013 Part I. LNCS (LNAI), vol. 7818, pp. 532ā543. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37453-1_44
Alharbi, A.S., Li, Y., Xu, Y.: Integrating LDA with clustering technique for relevance feature selection. In: Peng, W., Alahakoon, D., Li, X. (eds.) AI 2017. LNCS (LNAI), vol. 10400, pp. 274ā286. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-63004-5_22
Anava, Y., Shtok, A., Kurland, O., Rabinovich, E.: A probabilistic fusion framework. In: CIKM 2016, pp. 1463ā1472. ACM (2016)
Bashar, M.A., Li, Y.: Random set to interpret topic models in terms of ontology concepts. In: Peng, W., Alahakoon, D., Li, X. (eds.) AI 2017. LNCS (LNAI), vol. 10400, pp. 237ā249. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-63004-5_19
Bashar, M.A., Li, Y., Gao, Y.: A framework for automatic personalised ontology learning. In: WI 2016, pp. 105ā112. IEEE (2016)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993ā1022 (2003)
Buckley, C., Voorhees, E.M.: Evaluating evaluation measure stability. In: SIGIR 2000, pp. 33ā40. ACM (2000)
Croft, W.B.: Combining approaches to information retrieval. In: Croft, W.B. (ed.) Advances in Information Retrieval. INRE, vol. 7, pp. 1ā36. Springer, Boston (2002). https://doi.org/10.1007/0-306-47019-5_1
Gao, Y., Xu, Y., Li, Y.: Pattern-based topic models for information filtering. In: ICDM 2013, pp. 921ā928. IEEE (2013)
Gao, Y., Xu, Y., Li, Y.: Topical pattern based document modelling and relevance ranking. In: Benatallah, B., Bestavros, A., Manolopoulos, Y., Vakali, A., Zhang, Y. (eds.) WISE 2014 Part I. LNCS, vol. 8786, pp. 186ā201. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11749-2_15
Gao, Y., Xu, Y., Li, Y.: Pattern-based topics for document modelling in information filtering. IEEE TKDE 27(6), 1629ā1642 (2015)
Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 42(1ā2), 177ā196 (2001)
Joachims, T.: Optimizing search engines using clickthrough data. In: KDD 2002, pp. 133ā142. ACM (2002)
Lan, M., Tan, C.L., Su, J., Lu, Y.: Supervised and traditional term weighting methods for automatic text categorization. IEEE TPAMI 31(4), 721ā735 (2009)
Li, Y.: Extended random sets for knowledge discovery in information systems. In: Wang, G., Liu, Q., Yao, Y., Skowron, A. (eds.) RSFDGrC 2003. LNCS (LNAI), vol. 2639, pp. 524ā532. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-39205-X_87
Li, Y., Algarni, A., Albathan, M., Shen, Y., Bijaksana, M.A.: Relevance feature discovery for text mining. IEEE TKDE 27(6), 1656ā1669 (2015)
Li, Y., Algarni, A., Zhong, N.: Mining positive and negative patterns for relevance feature discovery. In: KDD 2010, pp. 753ā762. ACM (2010)
Li, Y., Li, T., Liu, H.: Recent advances in feature selection and its applications. Knowl. Inf. Syst. 53, 1ā27 (2017)
Macdonald, C., Ounis, I.: Global statistics in proximity weighting models. In: Web N-gram Workshop. p. 30. Citeseer (2010)
Manning, C.D., Raghavan, P., SchĆ¼tze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Maxwell, K.T., Croft, W.B.: Compact query term selection using topically related text. In: SIGIR 2013, pp. 583ā592. ACM (2013)
McCallum, A.K.: Mallet: a machine learning for language toolkit (2002)
Molchanov, I.: Theory of Random Sets. Springer, Heidelberg (2006). https://doi.org/10.1007/1-84628-150-4
Nguyen, H.T.: Random sets. Scholarpedia 3(7), 3383 (2008)
Robertson, S., Zaragoza, H.: The Probabilistic Relevance Framework: BM25 and Beyond. Now Publishers Inc., Breda (2009)
Robertson, S.E., Soboroff, I.: The TREC 2002 filtering track report. In: TREC, vol. 2002, p. 5 (2002)
Steyvers, M., Griffiths, T.: Probabilistic topic models. Handb. Latent Semant. Anal. 427(7), 424ā440 (2007)
Wang, X., McCallum, A., Wei, X.: Topical n-grams: phrase and topic discovery, with an application to information retrieval. In: ICDM 2007, pp. 697ā702. IEEE (2007)
Wu, S.: Data Fusion in Information Retrieval. Springer, Heidelberg (2012)
Zhang, S., Balog, K.: Design patterns for fusion-based object retrieval. In: Jose, J.M., Hauff, C., Altıngovde, I.S., Song, D., Albakour, D., Watt, S., Tait, J. (eds.) ECIR 2017. LNCS, vol. 10193, pp. 684ā690. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-56608-5_66
Zhong, N., Li, Y., Wu, S.T.: Effective pattern discovery for text mining. IEEE TKDE 24(1), 30ā44 (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
Ā© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Alharbi, A.S., Li, Y., Xu, Y. (2018). An Extended Random-Sets Model forĀ Fusion-Based Text Feature Selection. In: Phung, D., Tseng, V., Webb, G., Ho, B., Ganji, M., Rashidi, L. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2018. Lecture Notes in Computer Science(), vol 10939. Springer, Cham. https://doi.org/10.1007/978-3-319-93040-4_11
Download citation
DOI: https://doi.org/10.1007/978-3-319-93040-4_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-93039-8
Online ISBN: 978-3-319-93040-4
eBook Packages: Computer ScienceComputer Science (R0)