Mining Extremely Skewed Trading Anomalies
Trading surveillance systems screen and detect anomalous trades of equity, bonds, mortgage certificates among others. This is to satisfy federal trading regulations as well as to prevent crimes, such as insider trading and money laundry. Most existing trading surveillance systems are based on hand-coded expert-rules. Such systems are known to result in long developing process and extremely high “false positive” rates. We participate in co-developing a data mining based automatic trading surveillance system for one of the biggest banks in the US. The challenge of this task is to handle very skewed positive classes (< 0.01%) as well as very large volume of data (millions of records and hundreds of features). The combination of very skewed distribution and huge data volume poses new challenge for data mining; previous work addresses these issues separately, and existing solutions are rather complicated and not very straightforward to implement. In this paper, we propose a simple systematic approach to mine “very skewed distribution in very large volume of data”.
KeywordsFalse Positive Rate Loss Function Probability Estimate Main Memory True Positive Rate
Unable to display preview. Download preview PDF.
- 1.Fan, W., Wang, H., Yu, P.S., Stolfo, S.: A framework for scalable cost-sensitive learning based on combining probabilities and benefits. In: Second SIAM International Conference on Data Mining, SDM 2002 (2002)Google Scholar
- 2.Shafer, J., Agrawl, R., Mehta, M.: SPRINT: A scalable parallel classifier for data mining. In: Proceedings of Twenty-second International Conference on Very Large Databases (VLDB 1996), pp. 544–555. Morgan Kaufmann, San Francisco (1996)Google Scholar
- 3.Gehrke, J., Ganti, V., Ramakrishnan, R., Loh, W.Y.: BOAT-optimistic decision tree construction. In: Proceedings of ACM SIGMOD International Conference on Management of Data, SIGMOD 1999 (1999)Google Scholar
- 4.Chan, P.: An Extensible Meta-learning Approach for Scalable and Accurate Inductive Learning. PhD thesis, Columbia University (1996)Google Scholar