Using Sub-sequence Patterns for Detecting Organ Trafficking Websites

  • Suraj Jung Pandey
  • Suresh Manandhar
  • Agnieszka Kleszcz
Part of the Communications in Computer and Information Science book series (CCIS, volume 368)

Abstract

This paper presents a novel method for mining suspicious websites from World Wide Web by using state-of-the-art pattern mining and machine learning methods. In this document, the term “suspicious website” is used to mean any website that contains known or suspected violations of law. Although, we present our evaluation on illegal online organ trading, the method described in this paper is generic and can be used to detect any specific kind of websites. We use an iterative setting in which at each iterations we unearth both normal and suspicious websites. These newly detected websites are augmented in our training examples and used in next iterations. The first iteration uses user supplied seed normal and suspicious websites. We show that the accuracy increases in intial iterations but decreases with further increase in iterations. This is due to the bias caused by adding large number of normal websites and also due to the automatic addition of noise in training examples.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Heyes, J.D.: Global organ harvesting a booming black market business; a kidney harvested every hour, http://www.naturalnews.com/036052_organ_harvesting_kidneys_black_market.html (accessed January 30, 2013)
  2. 2.
    Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998), http://citeseer.ist.psu.edu/joachims97text.html CrossRefGoogle Scholar
  3. 3.
    Baeza-Yates, R.A., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley Longman Publishing Co. Inc., Boston (1999)Google Scholar
  4. 4.
    Li, Y., Zhang, C., Swan, J.: An information filtering model on the web and its application in jobagent. Knowledge-Based Systems 13(5), 285–296 (2000), http://www.sciencedirect.com/science/article/pii/S0950705100000885 CrossRefGoogle Scholar
  5. 5.
    Robertson, S., Soboroff, I.: The trec 2002 filtering track report. In: Text Retrieval Conference (2002)Google Scholar
  6. 6.
    Lewis, D.D.: Feature selection and feature extraction for text categorization. In: Proceedings of the Workshop on Speech and Natural Language, HLT 1991, pp. 212–217. Association for Computational Linguistics, Stroudsburg (1992)Google Scholar
  7. 7.
    Scott, S., Matwin, S.: Feature engineering for text classification. In: Proceedings of the Sixteenth International Conference on Machine Learning, ICML 1999, pp. 379–388. Morgan Kaufmann Publishers Inc., San Francisco (1999)Google Scholar
  8. 8.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34, 1–47 (2002)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Wang, Z., Zhang, D.: Feature selection in text classification via svm and lsi. In: Wang, J., Yi, Z., Żurada, J.M., Lu, B.-L., Yin, H. (eds.) ISNN 2006. LNCS, vol. 3971, pp. 1381–1386. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  10. 10.
    Klein, D., Manning, C.D.: Accurate unlexicalized parsing. In: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, ACL 2003, vol. 1, pp. 423–430. Association for Computational Linguistics, Stroudsburg (2003), http://dx.doi.org/10.3115/1075096.1075150 CrossRefGoogle Scholar
  11. 11.
    De Marneffe, M.C., Maccartney, B., Manning, C.D.: Generating typed dependency parses from phrase structure parses. In: LREC 2006 (2006)Google Scholar
  12. 12.
    Matsumoto, S., Takamura, H., Okumura, M.: Sentiment classification using word sub-sequences and dependency sub-trees. In: Ho, T.B., Cheung, D., Liu, H. (eds.) PAKDD 2005. LNCS (LNAI), vol. 3518, pp. 301–311. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  13. 13.
    Data mining for path traversal patterns in a web environment. In: Proceedings of the 16th International Conference on Distributed Computing Systems, ICDCS 1996, pp. 385–392. IEEE Computer Society, Washington, DC (1996)Google Scholar
  14. 14.
    Pei, J., Han, J., Mortazavi-Asl, B., Zhu, H.: Mining access patterns efficiently from web logs. In: Terano, T., Liu, H., Chen, A.L.P. (eds.) PAKDD 2000. LNCS (LNAI), vol. 1805, pp. 396–407. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  15. 15.
    Wu, S.-T., Li, Y., Xu, Y.: Deploying approaches for pattern refinement in text mining. In: Proceedings of the Sixth International Conference on Data Mining, ICDM 2006, pp. 1157–1161. IEEE Computer Society, Washington, DC (2006)Google Scholar
  16. 16.
    Jindal, N., Liu, B.: Identifying comparative sentences in text documents. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2006, pp. 244–251. ACM, New York (2006)CrossRefGoogle Scholar
  17. 17.
    Agrawal, R., Srikant, R.: Mining sequential patterns. In: Proceedings of the Eleventh International Conference on Data Engineering, ICDE 1995, pp. 3–14. IEEE Computer Society, Washington, DC (1995)CrossRefGoogle Scholar
  18. 18.
    Feldman, R.: Mining associations in text in the presence of background knowledge. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD 1996, pp. 343–346 (1996)Google Scholar
  19. 19.
    Holt, J.D., Chung, S.M.: Multipass algorithms for mining association rules in text databases. Knowl. Inf. Syst. 3, 168–183 (2001)MATHCrossRefGoogle Scholar
  20. 20.
    Dunning, T.: Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics 19(1), 61–74 (1993)Google Scholar
  21. 21.
    Buckley, C., Salton, G., Allan, J., Singhal, A.: Automatic Query Expansion Using SMART: TREC 3. In: TREC (1994)Google Scholar
  22. 22.
    Sahami, M., Heilman, T.: A web-based kernel function for matching short text snippets. In: International Workshop Located at the 22nd International Conference on Machine Learning (ICML), pp. 2–9 (2005)Google Scholar
  23. 23.
    Abhishek, V., Hosanagar, K.: Keyword generation for search engine advertising using semantic similarity between terms. In: Proceedings of the Ninth International Conference on Electronic Commerce, ICEC 2007, pp. 89–94. ACM, New York (2007)Google Scholar
  24. 24.
    Joshi, A., Motwani, R.: Keyword generation for search engine advertising. In: Sixth IEEE International Conference on Data Mining Workshops, ICDM Workshops 2006, pp. 490–496 (December 2006)Google Scholar
  25. 25.
    Moschitti, A., Quarteroni, S., Basili, R., Manandhar, S.: Exploiting syntactic and shallow semantic kernels for question answer classification. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics (2007)Google Scholar
  26. 26.
    Joshi, M., Pedersen, T., Maclin, R., Pakhomov, S.: Kernel methods for word sense disambiguation and acronym expansion. In: Proceedings of the 21st National Conference on Artificial Intelligence, vol. 2, pp. 1879–1880. AAAI Press (2006), http://portal.acm.org/citation.cfm?id=1597348.1597488
  27. 27.
    Lee, Y.K., Ng, H.T., Chia, T.K.: Supervised word sense disambiguation with support vector machines and multiple knowledge sources. In: Mihalcea, R., Edmonds, P. (eds.) Senseval-3: Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text, pp. 137–140. Association for Computational Linguistics, Barcelona (2004)Google Scholar
  28. 28.
    Zelenko, D., Aone, C., Richardella, A.: Kernel methods for relation extraction. J. Mach. Learn. Res. 3, 1083–1106 (2003), http://portal.acm.org/citation.cfm?id=944919.944964 MathSciNetMATHGoogle Scholar
  29. 29.
    Chang, C.-C., Lin, C.-J.: LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 27:1–27:27 (2011), software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Suraj Jung Pandey
    • 1
  • Suresh Manandhar
    • 1
  • Agnieszka Kleszcz
    • 2
  1. 1.University of YorkHeslingtonUK
  2. 2.AGH University of Science and TechnologyKrakowPoland

Personalised recommendations