A Two-Step Dimensionality Reduction Scheme for Dark Web Text Classification

  • Mohd Faizan
  • Raees Ahmad Khan
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 1097)


Dark web is infamous for the presence of unethical and illegal content on it. The intelligence agencies are increasingly using an automated approach to detect such content. Machine learning classification techniques can be used to detect such content in textual data from dark Web sites. However, their performance suffers due to the presence of irrelevant features in the dataset. In this paper, a two-step dimensionality reduction scheme based on mutual information and linear discriminant analysis for classifying dark web textual content is proposed. This scheme filters out the irrelevant features using mutual information scheme in the first step. The remaining features are then transformed into a new space for a reduction in the number of features using linear discriminant analysis. The proposed scheme is tested on the dark web dataset collected explicitly from dark Web sites using a web crawler and on the Reuters-21,578 dataset for benchmarking purpose. Three different classifiers were used for classification. The results obtained on the two datasets indicate that the proposed two-step technique can positively improve the classification performance along with a significant decrease in the number of features.


Dark web Text classification Feature selection Dimensionality reduction 


  1. 1.
    Reed, M.G., P.F. Syverson, and D.M. Goldschlag. 1998. Anonymous connections and onion routing. IEEE Journal on Selected Areas in Communications 16 (4): 482–494.CrossRefGoogle Scholar
  2. 2.
    Guitton, C. 2013. A review of the available content on Tor hidden services: The case against further development. Computers in Human Behavior 29 (1): 2805–2815. Scholar
  3. 3.
    Biryukov, A., et al. 2014. Content and popularity analysis of Tor hidden services. In Proceedings of the IEEE 34th international conference on distributed computing systems workshops, 188–193. Washington: IEEE Computer Society.Google Scholar
  4. 4.
    Faizan, M., and R.A. Khan. 2019. Exploring and analyzing the dark web: A new alchemy. First Monday 24(5).
  5. 5.
    Owen, G., and N. Savage. 2016. Empirical analysis of Tor hidden services. IET Information Security 10 (3): 113–118. Scholar
  6. 6.
    Al Nabki, M.W., et al. 2017. Classifying illegal activities on tor network based on web textual contents. In Proceedings of the 15th conference of the European chapter of the association for computational linguistics, 35–43. Stroudsburg: ACL.Google Scholar
  7. 7.
    Battiti, R. 1994. Using mutual information for selecting features in supervised neural net learning. IEEE Transactions on Neural Networks and Learning Systems 5: 537–550.CrossRefGoogle Scholar
  8. 8.
    Kononenko, I. 1994. Estimating attributes: Analysis and extensions of relief. In Proceedings of the European conference on machine Learning, 171–182.Google Scholar
  9. 9.
    Li, Y., C. Luo, and S. Chung. 2008. Text clustering with feature selection by using statistical data. IEEE Transactions on Knowledge and Data Engineering 20: 641–652.CrossRefGoogle Scholar
  10. 10.
    Liu, L., et al. 2005. A comparative study on unsupervised feature selection methods for text clustering. In Proceedings of the IEEE international conference on natural language processing and knowledge engineering, 597–601. China: IEEE.Google Scholar
  11. 11.
    Mitchel, T. 1997. Machine learning. New York: McGraw-Hill.Google Scholar
  12. 12.
    Jolliffe, T. 2002. Principal component analysis. New York: Springer-Verlag.zbMATHGoogle Scholar
  13. 13.
    Song, W., and S. Park. 2009. Genetic algorithm for text clustering based on latent semantic indexing. Computers and Mathematics with Applications 57: 1901–1907.CrossRefGoogle Scholar
  14. 14.
    Fisher, R.A. 1938. The statistical utilization of multiple measurements. Annals of Human Genetics 8 (4): 376–386.zbMATHGoogle Scholar
  15. 15.
    Labani, M., et al. 2018. A novel multivariate filter method for feature selection in text classification problems. Engineering Applications of Artificial Intelligence 70: 25–37. Scholar
  16. 16.
    Wang, Y., and L. Feng. 2018. Hybrid feature selection using component co-occurrence based feature relevance measurement. Expert System with Applications 102: 83–99. Scholar
  17. 17.
    Zhang, Y., C. Ding, and T. Li. 2008. Gene selection algorithm by combining ReliefF and MRMR. In Proceedings of the IEEE 7th international conference on bioinformatics and bio engineering, 127–132. Boston: IEEE.Google Scholar
  18. 18.
    Jadhav, S., H. He, and K. Jenkins. 2018. Information gain directed genetic algorithm wrapper feature selection for credit rating. Applied Soft Computing 69: 541–553. Scholar
  19. 19.
    Khammassi, C., and S. Krichen. 2017. A GA-LR wrapper approach for feature selection in network intrusion detection. Computers & Securtity 70: 255–277. Scholar
  20. 20.
    Zheng, Y., Y. Li, G. Wang, et al. 2018. A novel hybrid algorithm for feature selection. Personal and Ubiquitous Computing 22 (5–6): 971–985. Scholar
  21. 21.
    Xue, X., M. Yao, and Z. Wu. 2018. A novel ensemble-based wrapper method for feature selection using extreme learning machine and genetic algorithm. Knowledge and Information Systems 57 (2): 389–412. Scholar
  22. 22.
    Solorio-Fernández, S., J. ArielCarrasco-Ochoa, and J. Fco. Martínez-Trinidad. 2016. A new hybrid filter–wrapper feature selection method for clustering based on ranking. Neurocomputing 214, 866–880.
  23. 23.
    Sahu, B., and D. Mishra. 2012. A novel feature selection algorithm using particle swarm optimization for cancer microarray data. Procedia Engineering 38: 27–31.CrossRefGoogle Scholar
  24. 24.
    Uguz, H. 2011. A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm. Knowledge-Based Systems 24 (7): 1024–1032.CrossRefGoogle Scholar
  25. 25.
    Chen, X., and L. Wang. 2018. A new dimensionality reduction method with correlation analysis and universum learning. Pattern Recognition and Image Analysis 28 (2): 174–184. Scholar
  26. 26.
    Ben Brahim, A., and M. Limam. 2018. Ensemble feature selection for high dimensional data: A new method and a comparative study. Advances in Data Analysis and Classification 12 (4): 937–952. Scholar
  27. 27.
    He, J., et al. 2017. Unsupervised feature selection based on decision graph. Neural Computing and Applications 28 (10): 3047–3059.CrossRefGoogle Scholar
  28. 28.
    Wang, F., et al. 2015. A two-stage feature selection method for text categorization by using category correlation degree and latent semantic indexing. Journal of Shanghai Jiaotong University (Science) 20 (1): 44–50.CrossRefGoogle Scholar
  29. 29.
    Reuters-21578 text categorization collection, distribution 1.0.
  30. 30.
    Sebastiani, F. 2002. Machine learning in automated text categorization. ACM Computing Surveys 34 (1): 1–47.MathSciNetCrossRefGoogle Scholar
  31. 31.
    Pedregosa, F., et al. 2011. Scikit-learn: Machine learning in python. Journal of Machine Learning Research. 12: 2825–2830.MathSciNetzbMATHGoogle Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2020

Authors and Affiliations

  • Mohd Faizan
    • 1
  • Raees Ahmad Khan
    • 1
  1. 1.Department of Information TechnologyBabasaheb Bhimrao Ambedkar UniversityLucknowIndia

Personalised recommendations