Reference Data Sets for Spam Detection: Creation, Analysis, Propagation
A reference set is a set of data of network traffic whose form and content allows detecting an event or a group of events. Realistic and representative datasets based on real traffic can improve research in the fields of intruders and anomaly detection. Creating reference sets tackles a number of issues such as the collection and storage of large volumes of data, the privacy of information and the relevance of collected events. Moreover, rare events are hard to analyse among background traffic and need specialist detection tools. One of the common problems that can be detected in network traffic is spam. This paper presents the methodology for creating a network traffic reference set for spam detection. The methodology concerns the selection of significant features, the collection and storage of data, the analysis of the collected data, the enrichment of the data with additional events and the propagation of the set. Moreover, a hybrid classifier that detects spam on relatively high level is presented.
KeywordsReference sets Spam detection Flow analysis Anomaly detection Hybrid classifiers
Unable to display preview. Download preview PDF.
- 1.Behera, G.: Privacy preserving c4.5 using gini index, pp. 1–4 (March 2011)Google Scholar
- 3.Deri, L.: nprobe: an open source netflow probe for gigabit networks. In: Proc. of Terena TNC 2003 (2003)Google Scholar
- 4.Fomenkov, M., Claffy, K.: Internet measurement data management challenges. In: Workshop on Research Data Lifecycle Management, Princeton, NJ (July 2011)Google Scholar
- 5.Grzenda, M.: Towards the reduction of data used for the classification of network flows. In: Corchado, E., Snášel, V., Abraham, A., Woźniak, M., Graña, M., Cho, S.-B. (eds.) HAIS 2012, Part II. LNCS, vol. 7209, pp. 68–77. Springer, Heidelberg (2012), http://dx.doi.org/10.1007/978-3-642-28931-6_7 CrossRefGoogle Scholar
- 6.Kim, H., Claffy, K., Fomenkov, M., Barman, D., Faloutsos, M., Lee, K.: Internet traffic classification demystified: myths, caveats, and the best practices. In: Proceedings of the 2008 ACM CoNEXT Conference, CoNEXT 2008, pp. 11:1–11:12. ACM, New York (2008)Google Scholar
- 7.Kobiersky, P., Korenek, J., Polcak, L.: Packet header analysis and field extraction for multigigabit networks, pp. 96–101 (April 2009)Google Scholar
- 8.Limwiwatkul, L., Rungsawang, A.: Distributed denial of service detection using tcp/ip header and traffic measurement analysis, vol. 1, pp. 605–610 (October 2004)Google Scholar
- 9.Moore, A., Crogan, M., Moore, A.W., Mary, Q., Zuev, D., Zuev, D., Crogan, M.L.: Discriminators for use in flow-based classification. Tech. rep. (2005)Google Scholar
- 11.Schatzmann, D., Burkhart, M., Spyropoulos, T.: Flow-level characteristics of spam and ham (291) (August 2008)Google Scholar
- 12.Žádník, M., Michlovský, Z.: Is spam visible in flow-level statistics? Tech. rep. (2009), http://www.fit.vutbr.cz/research/view_pub.php?id=9277