A Two-Step Dimensionality Reduction Scheme for Dark Web Text Classification
- 11 Downloads
Dark web is infamous for the presence of unethical and illegal content on it. The intelligence agencies are increasingly using an automated approach to detect such content. Machine learning classification techniques can be used to detect such content in textual data from dark Web sites. However, their performance suffers due to the presence of irrelevant features in the dataset. In this paper, a two-step dimensionality reduction scheme based on mutual information and linear discriminant analysis for classifying dark web textual content is proposed. This scheme filters out the irrelevant features using mutual information scheme in the first step. The remaining features are then transformed into a new space for a reduction in the number of features using linear discriminant analysis. The proposed scheme is tested on the dark web dataset collected explicitly from dark Web sites using a web crawler and on the Reuters-21,578 dataset for benchmarking purpose. Three different classifiers were used for classification. The results obtained on the two datasets indicate that the proposed two-step technique can positively improve the classification performance along with a significant decrease in the number of features.
KeywordsDark web Text classification Feature selection Dimensionality reduction
- 3.Biryukov, A., et al. 2014. Content and popularity analysis of Tor hidden services. In Proceedings of the IEEE 34th international conference on distributed computing systems workshops, 188–193. Washington: IEEE Computer Society.Google Scholar
- 4.Faizan, M., and R.A. Khan. 2019. Exploring and analyzing the dark web: A new alchemy. First Monday 24(5). https://doi.org/10.5210/fm.v24i5.9473.
- 6.Al Nabki, M.W., et al. 2017. Classifying illegal activities on tor network based on web textual contents. In Proceedings of the 15th conference of the European chapter of the association for computational linguistics, 35–43. Stroudsburg: ACL.Google Scholar
- 8.Kononenko, I. 1994. Estimating attributes: Analysis and extensions of relief. In Proceedings of the European conference on machine Learning, 171–182.Google Scholar
- 10.Liu, L., et al. 2005. A comparative study on unsupervised feature selection methods for text clustering. In Proceedings of the IEEE international conference on natural language processing and knowledge engineering, 597–601. China: IEEE.Google Scholar
- 11.Mitchel, T. 1997. Machine learning. New York: McGraw-Hill.Google Scholar
- 17.Zhang, Y., C. Ding, and T. Li. 2008. Gene selection algorithm by combining ReliefF and MRMR. In Proceedings of the IEEE 7th international conference on bioinformatics and bio engineering, 127–132. Boston: IEEE.Google Scholar
- 22.Solorio-Fernández, S., J. ArielCarrasco-Ochoa, and J. Fco. Martínez-Trinidad. 2016. A new hybrid filter–wrapper feature selection method for clustering based on ranking. Neurocomputing 214, 866–880. https://doi.org/10.1016/j.neucom.2016.07.026.
- 29.Reuters-21578 text categorization collection, distribution 1.0. http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html.