Multiple-cause discovery combined with structure learning for high-dimensional discrete data and application to stock prediction
- 447 Downloads
Causal discovery in observational data is crucial to a variety of scientific and business research. Although many causal discovery algorithms have been proposed in recent decades, none of them is effective enough in dealing with high-dimensional discrete data. The main challenge is the complex interactions among large volume of variables, leading to numerous spurious causalities found. In this work, we propose a novel multiple-cause discovery method combined with structure learning (McDSL) to eliminate the spurious causalities. The method is carried out in two phases. In the first phase, conditional independence test is used to distinguish direct causal candidates from the indirect ones. In the second phase, causal direction of multi-cause structure is carefully determined with a hybrid causal discovery method. Validation experiments on synthetic data showed that McDSL is reliable in discovering multi-cause structures and eliminating indirect causes. We then applied this algorithm in discovering multiple causes of stock return based on 13-year historical financial data of the Shanghai Stock Exchanges of China, and established a stock prediction model. Experimental results showed that the McDSL discovered causes revealed changes of key risk factors of the stock market over 13 years, which indicated investors should change their investment strategy over time. Moreover, the causes discovered by McDSL have better performance in predicting stock return than that of other common filter-based feature selection algorithms.
KeywordsCausal discovery High-dimensional discrete data Structure learning Additive noise model Stock prediction
This research was partly supported by the National Natural Science Foundation of China (71271061, 70801020), Science and Technology Planning Project of Guangdong Province, China (2010B010600034, 2012B091100192), Guangdong Natural Science Foundation Research Team (S2013030015737), and Business Intelligence Key Team of Guangdong University of Foreign Studies (TD1202).
- Cai R, Zhang Z, Hao Z (2013b) Sada: a general framework to support robust causation discovery. In: Proceedings of the 30th international conference on machine learning, pp 208–216Google Scholar
- Chang YC, Hsieh YL, Chen CC, Hsu WL (2015) A semantic frame-based intelligent agent for topic detection. Soft Comput. doi: 10.1007/s00500-015-1695-4
- Esposito C, Ficco M, Palmieri F, Castiglione A (2015) Smart cloud storage service selection based on fuzzy logic, theory of evidence and game theory. IEEE Trans Comput. doi: 10.1109/TC.2015.2389952
- Fernandez-Lozano C, Seoane JA, Gestal M, Gaunt TR, Dorado J, Campbell C (2015) Texture classification using feature selection and kernel-based techniques. Soft Comput doi:10.1007/s00500-014-1573-5Google Scholar
- Hoyer PO, Janzing D, Mooij JM, Peters J, Schölkopf B (2009) Nonlinear causal discovery with additive noise models. In: Advances in neural information processing systems, pp 689–696Google Scholar
- Kano Y, Shimizu S (2003) Causal inference using nonnormality. In: Proceedings of the international symposium on science of modeling, the 30th anniversary of the information criterion, pp 261–270Google Scholar
- Koller D, Sahami M (1996) Toward optimal feature selection. Proc int conf mach Learn 20(1113):284–292Google Scholar
- Mooij J, Janzing D, Peters J, Schölkopf B (2009) Regression by dependence minimization and its application to causal inference in additive noise models. In: Proceedings of the 26th annual international conference on machine learning, pp 745–752. ACMGoogle Scholar
- Peters J, Janzing D, Gretton A, Schölkopf B (2009) Detecting the direction of causal time series. In: Proceedings of the 26th annual international conference on machine learning, pp 801–808. ACMGoogle Scholar
- Peters J, Janzing D, Schölkopf B (2010) Identifying cause and effect on discrete data using additive noise models. In: International conference on artificial intelligence and statistics, pp 597–604Google Scholar
- Tsamardinos I, Aliferis CF, Statnikov A (2003) Time and sample efficient discovery of markov blankets and direct causal relations. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining, pp 673–678. ACMGoogle Scholar