Machine Learning

, Volume 107, Issue 5, pp 825–858 | Cite as

Consensus-based modeling using distributed feature construction with ILP

Article
  • 110 Downloads

Abstract

A particularly successful role for Inductive Logic Programming (ILP) is as a tool for discovering useful relational features for subsequent use in a predictive model. Conceptually, the case for using ILP to construct relational features rests on treating these features as functions, the automated discovery of which necessarily requires some form of first-order learning. Practically, there are now several reports in the literature that suggest that augmenting any existing feature with ILP-discovered relational features can substantially improve the predictive power of a model. While the approach is straightforward enough, much still needs to be done to scale it up to explore more fully the space of possible features that can be constructed by an ILP system. This is in principle, infinite and in practice, extremely large. Applications have been confined to heuristic or random selections from this space. In this paper, we address this computational difficulty by allowing features and models to be constructed in a distributed manner. That is, there is a network of computational units, each of which employs an ILP engine to construct some small number of features and then builds a (local) model. We then employ an asynchronous consensus-based algorithm, in which neighboring nodes share information and update local models. This gossip-based information exchange results in the formation of non-stationary Markov chains. For a category of models (those with convex loss functions), it can be shown (using the Supermartingale Convergence Theorem) that the algorithm will result in all nodes converging to a consensus model. In practice, it may be slow to achieve this convergence. Nevertheless, our results on synthetic and real datasets suggest that in relatively short time the “best” node in the network reaches a model whose predictive accuracy is comparable to that obtained using more computational effort in a non-distributed setting (the best node is identified as the one whose weights converge first).

Keywords

Inductive logic programming Consensus based learning Stochastic gradient descent Feature selection 

Notes

Acknowledgements

H.D. is also an adjunct assistant professor at the Department of Computer Science at IIIT, Delhi and an Affiliated Member of the Institute of Data Sciences, Columbia University, NY. A.S. also holds visiting positions at the School of CSE, University of New South Wales, Sydney; and at the Department of Computer Science, Oxford University, Oxford.

References

  1. Agarwal, A., Chapelle, O., Dudík, M., & Langford, J. (2014). A reliable effective terascale linear learning system. Journal of Machine Learning Research, 15, 1111–1133.MathSciNetMATHGoogle Scholar
  2. Agrawal, R. & Srikant, R. (1995). Mining sequential patterns. In ICDE.Google Scholar
  3. Antunes, C. & Oliveira, A. L. (2003). Generalization of pattern-growth methods for sequential pattern mining with gap constraints. In MLDM.Google Scholar
  4. Aseervatham, S., Osmani, A., & Viennet, E. (2006). bitSPADE: A lattice-based sequential pattern mining algorithm using bitmap representation. In ICDM.Google Scholar
  5. Ayres, J., Gehrke, J., Yiu, T., & Flannick, J. (2002). Sequential pattern mining using a bitmap representation. In KDD.Google Scholar
  6. Benezit, F., Dimakis, A. G., Thiran, P., & Vetterli, M. (2010). Order-optimal consensus through randomized path averaging. IEEE Transactions on Information Theory, 56(10), 5150–5167.MathSciNetCrossRefMATHGoogle Scholar
  7. Bertsekas, D. P., & Tsitsiklis, J. N. (1997). Parallel and distributed computation: numerical methods.Google Scholar
  8. Blum, A. (1992). Learning boolean functions in an infinite attribute space. Machine Learning, 9(4), 373–386.MATHGoogle Scholar
  9. Bottou, L. (2010). Large-scale machine learning with stochastic gradient descent. In Proceedings of the 19th international conference on computational statistics (COMPSTAT’2010), pp. 177–187.Google Scholar
  10. Bottou, L., & Bousquet, O. (2011). The tradeoffs of large scale learning. In Optimization for machine learning (pp. 351–368).Google Scholar
  11. Bottou, L., & Bousquet, O. (2011). The tradeoffs of large scale learning. In Optimization for machine learning (pp. 351–368). MIT Press.Google Scholar
  12. Boyd, S., Ghosh, A., Prabhakar, B., & Shah, D. (2006). Randomized gossip algorithms. IEEE/ACM Transaction Network, 14(SI), 2508–2530.MathSciNetMATHGoogle Scholar
  13. Boyd, S., Parikh, N., Chu, E., Peleato, B., & Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1), 1–122.CrossRefMATHGoogle Scholar
  14. Carlson, A., Cumby, C., Rosen, J., & Roth, D. (1999). The snow learning architecture. Technical Report UIUCDCS-R-99-2101, UIUC Computer Science Department, 5 .Google Scholar
  15. Chalamalla, A., Negi, S., Venkata Subramaniam, L., & Ramakrishnan, G. (2008). Identification of class specific discourse patterns. In CIKM, pp. 1193–1202.Google Scholar
  16. Christoudias, C. M., Urtasun, R., & Darrell, T. (2008). Unsupervised distributed feature selection for multi-view object recognition. Technical Report MIT-CSAIL-TR-2008-009, MIT.Google Scholar
  17. Cybenko, G. (1989). Dynamic load balancing for distributed memory multiprocessors. Proceedings of the Journal of Parallel and Distributed Computing, 7, 279–301.CrossRefGoogle Scholar
  18. Darken, C., & Moody, J. (1990). Note on learning rate schedules for stochastic optimization. In Proceedings of the conference on advances in neural information processing systems, pp. 832–838.Google Scholar
  19. Das, K., Bhaduri, K., & Kargupta, H. (2010). A local asynchronous distributed privacy preserving feature selection algorithm for large peer-to-peer networks. Knowledge and Information Systems, 24(3), 341–367.CrossRefGoogle Scholar
  20. Davis, J., Burnside, E., de Castro Dutra, I., Page D., & Costa, V. S. (2005a). An integrated approach to learning Bayesian networks of rules. In Machine Learning: ECML 2005, pp. 84–95.Google Scholar
  21. Davis, J., Burnside, E. S., de Castro Dutra, I., Page, D., Ramakrishnan, R., Costa, V. S., & Shavlik, J. W. (2005b). View learning for statistical relational learning: With an application to mammography. In Proceedings of the nineteenth international joint conference on artificial intelligence, pp. 677–683.Google Scholar
  22. Davis, J., Ong, I., Struyf, J., Burnside, E., Page, D., & Costa, V. S. (2007). Change of representation for statistical relational learning. In Proceedings of the 20th international joint conference on artificial intelligence, pp. 2719–2726.Google Scholar
  23. Dehaspe, L., & De Raedt, L. (1995). Parallel inductive logic programming. Machine learning and knowledge discovery in databases. In Proceedings of the MLnet familiarization workshop on statistics (pp. 112–117).Google Scholar
  24. Dekel, O., Gilad-Bachrach, R., Shamir, O., & Xiao, L. (2012). Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research, 13(1), 165–202.MathSciNetMATHGoogle Scholar
  25. Dimakis, A. G., Sarwate, A. D., & Wainwright, M. J. (2006). Geographic gossip: Efficient aggregation for sensor networks. In The fifth international conference on information processing in sensor networks, pp. 69–76.Google Scholar
  26. Duchi, J., Agarwal, A., & Wainwright, M. (2012). Dual averaging for distributed optimization: Convergence analysis and network scaling. IEEE Transactions on Automatic Control, 57(3), 592–606.MathSciNetCrossRefMATHGoogle Scholar
  27. Džeroski, S. (1993). Handling imperfect data in inductive logic programming. In Proceedings of the Fourth Scandinavian Conference on Artificial Intelligence, pp. 111–125.Google Scholar
  28. Fischer, J. M., Lynch, N. A., & Paterson, M. S. (1985). Impossibility of distributed consensus with one faulty process. Journal of the ACM, 32(2), 374–382.MathSciNetCrossRefMATHGoogle Scholar
  29. Fonseca, N. A., Silva, F., & Camacho, R. (2005). Strategies to parallelize ILP systems. In Proceedings of the 15th international conference on inductive logic programming, pp. 136–153.Google Scholar
  30. Garcia, D. J., Hall, L. O, Goldgof, D. B. & Kramer K. (2006). A parallel feature selection algorithm from random subsets. In Proceedings of the international workshop on parallel data mining.Google Scholar
  31. Garofalakis, M. N., Rastogi, R., & Shim, K. (1999). Spirit: Sequential pattern mining with regular expression constraints. In VLDB.Google Scholar
  32. Han, Y., & Wang, J. (2009). An l1 regularization framework for optimal rule combination. In ECML/PKDD.Google Scholar
  33. Jawanpuria, P., Nath, J. S., & Ramakrishnan, G. (2011). Efficient rule ensemble learning using hierarchical kernels. In ICML, pp. 161–168.Google Scholar
  34. Jelasity, M., Guerraoui, R., Kermarrec, A., & Steen, M. (2004). The peer sampling service: Experimental evaluation of unstructured gossip-based implementations. In Middleware 2004, Vol. 3231, pp. 79–98.Google Scholar
  35. Jelasity, M., Montresor, A., & Babaoglu, Ö. (2005). Gossip-based aggregation in large dynamic networks. ACM Transaction on Computational Systems, 23(3), 219–252.CrossRefGoogle Scholar
  36. Ji, X., Bailey, J., & Dong, G. (2006). Mining minimal distinguishing subsequence patterns with gap constraints. Knowledge and Information Systems.Google Scholar
  37. John, G. H., Kohavi, R., & Pfleger, K. (1994). Irrelevant features and the subset selection problem. In Proceedings of the eleventh international conference on machine learning, pp. 121–129 .Google Scholar
  38. Joshi, S., Ramakrishnan, G., & Srinivasan, A. (2008). Feature construction using theory-guided sampling and randomised search. In ILP, pp 140–157.Google Scholar
  39. Kempe, D., Dobra, A., & Gehrke, J. (2003). Gossip-based computation of aggregate information. In Proceedings of 44th annual IEEE symposium on foundations of computer science, pp. 482–491.Google Scholar
  40. King, R. D., & Srinivasan, A. (1996). Prediction of rodent carcinogenicity bioassays from molecular structure using inductive logic programming. Environmental Health Perspectives, 104, 1031–1040.CrossRefGoogle Scholar
  41. King, R. D., Muggleton, S. H., Srinivasan, A., & Sternberg, M. J. (1996). Structure-activity relationships derived by machine learning: The use of atoms and their bond connectivities to predict mutagenicity by inductive logic programming. Proceedings of the National academy of Sciences of the United States of America, 93(1), 438–42.CrossRefGoogle Scholar
  42. Kudo, T., Maeda, E., & Matsumoto, Y. (2004). An application of boosting to graph classification. In NIPS.Google Scholar
  43. Landwehr, N., Kersting, K., & Raedt, L. D. (2007). Integrating naive bayes and foil. Journal of Machine Learning Research, 8, 481–507.MATHGoogle Scholar
  44. Langford, J., Smola, A., & Zinkevich, M. (2009). Slow learners are fast. In Advances in neural information processing systems, pp. 2331–2339.Google Scholar
  45. Larson, J., & Michalski, R. S. (1977). Inductive inference of VL decision rules. SIGART Bulletin, 63, 38–44.CrossRefGoogle Scholar
  46. Lavrac, N., & Dzeroski, S. (1993). Inductive logic programming: Techniques and applications (p. 10001). New York, NY: Routledge.MATHGoogle Scholar
  47. Littlestone, N. (1988). Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2(4), 285–318.Google Scholar
  48. Liu, H., & Motoda, H. (1998). Feature selection for knowledge discovery and data mining. Boston: Kluwer Academic Publishers.CrossRefMATHGoogle Scholar
  49. Lopez, F. G., Torres, M. G. A., Batista, B. M., Perez, J. A. M., & Moreno-Vega, J. M. (2006). Solving feature subset selection problem by a parallel scatter search. European Journal of Operational Research, 169(2), 477–489.MathSciNetCrossRefMATHGoogle Scholar
  50. Mangasarian, L. (1995). Parallel gradient distribution in unconstrained optimization. SIAM Journal on Control and Optimization, 33(6), 1916–1925.MathSciNetCrossRefMATHGoogle Scholar
  51. Michie, D., Bain, M., & Hayes-Michie, J. (1990). Cognitive models from subcognitive skills. In M.J. Grimble J. McGhee and P. Mowforth, (Eds.), Knowledge-based systems for industrial control (pp. 71–99). Peter Peregrinus for IEE, London.Google Scholar
  52. Montresor, A., & Jelasity, M., PeerSim. (2009). A scalable P2P simulator. In Proceedings of the 9th international conference on Peer-to-Peer (P2P’09), pp. 99–100 .Google Scholar
  53. Muggleton, S. (1994). Inductive logic programming: Derivations, successes and shortcomings. SIGART Bulletin, 5(1), 5–11.CrossRefGoogle Scholar
  54. Muggleton, S. (1995). Inverse entailment and progol. New Generation Computing, 13(3), 245–286.CrossRefGoogle Scholar
  55. Muggleton, S. H., Santos, J. C. A., & Tamaddoni-Nezhad, A. (2008). TopLog: ILP using a logic program declarative bias. Logic Programming, 5366, 687–692.MATHGoogle Scholar
  56. Nagesh, A., Ramakrishnan, G., Chiticariu, L., Krishnamurthy, R., Dharkar, A., & Bhattacharyya, P. (2012). Towards efficient named-entity rule induction for customizability. In EMNLP-CoNLL, pp. 128–138.Google Scholar
  57. Nair, N., Saha, A., Ramakrishnan, G., & Krishnaswamy, S. (2012). Rule ensemble learning using hierarchical kernels in structured output spaces. In AAAI.Google Scholar
  58. Nienhuys-Cheng, S., & De Wolf, R. (1997) Foundations of inductive logic programming. New York: Springer.Google Scholar
  59. Niu, F., Recht, B., Ré, C., & Wright, S. J. (2011). Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. Advances in Neural Information Processing Systems, 24, 693–701.Google Scholar
  60. Nowozin, S., Bakir, G., & Tsuda, K. (2007). Discriminative subsequence mining for action classification. In CVPR.Google Scholar
  61. Pei, J. (2004). Mining sequential patterns by pattern-growth: The PrefixSpan approach. Journal of Machine Learning Research, 16-11.Google Scholar
  62. Pei, J., Han, J., & Wang, W. (2005). Constraint-based sequential pattern mining: the pattern-growth methods. Journal of Intelligent Information Systems.Google Scholar
  63. Pei, J., Han, J., & Yan, X. (2004). From sequential pattern mining to structured pattern mining: A pattern-growth approach. Journal of Computer Science and Technology, 9(3), 257–279.MathSciNetGoogle Scholar
  64. Plotkin, G.D. (1971). Automatic methods of inductive inference. PhD thesis, Edinburgh University.Google Scholar
  65. Ramakrishnan, G., Joshi, S., Balakrishnan, S., & Srinivasan, A. (2007). Using ILP to construct features for information extraction from semi-structured text. In ILP, pp. 211–224.Google Scholar
  66. Ratnaparkhi, A. (1999). Learning to parse natural language with maximum entropy models. Machine Learning, 34(1), 151–175.CrossRefMATHGoogle Scholar
  67. Roth, D. (1998). Learning to resolve natural language ambiguities: A unified approach. In Proceedings of the innovative applications of artificial intelligence, pp. 806–813.Google Scholar
  68. Rückert, U. & Kramer, S. (2003). Stochastic local search in k-term dnf learning. In Proceedings of the 20th international conference on machine learning (ICML-03), pp. 648–655.Google Scholar
  69. Rückert, U., Kramer, S., & De Raedt, L. (2002). Phase transitions and stochastic local search in k-term dnf learning. In Proceedings of the 13th European conference on machine learning, pp. 405–417.Google Scholar
  70. Ryan, M., Hall, K., & Mann, G. (2010). Distributed training strategies for the structured perceptron. In The annual conference of the north American chapter of the association for computational linguistics, pp. 456–464.Google Scholar
  71. Saha, A., Srinivasan, A., & Ramakrishnan, G. (2012). What kinds of relational features are useful for statistical learning? In ILP.Google Scholar
  72. Sanov, I. N. (1957). On the probability of large deviations of random variables. Mat. Sbornik, 42, 11–44.MathSciNetGoogle Scholar
  73. Shah, D. (2009). Gossip algorithms. Foundations and Trends Netwroking, 3(1), 1–125.MathSciNetMATHGoogle Scholar
  74. Singh, S., Kubica, J., Larsen, S., & Sorokina, D., (2009). Parallel large scale feature selection for logistic regression. In SDM, pp. 1172–1183.Google Scholar
  75. Specia, L., Srinivasan, A., Ramakrishnan, G., & Graças Volpe Nunes, M. (2006). Word sense disambiguation using inductive logic programming. In ILP, pp. 409–423.Google Scholar
  76. Specia, L., Srinivasan, A., Joshi, S., Ramakrishnan, G., & Gracas, M. (2009). An investigation into feature construction to assist word sense disambiguation. Machine Learning, 76(1), 109–136.CrossRefGoogle Scholar
  77. Srinivasan, A. & Bain, M. (2014). An empirical study of on-line models for relational data streams. Technical Report 201401, School of Computer Science and Engineering, UNSW.Google Scholar
  78. Srinivasan, A. (1999). The aleph manual.Google Scholar
  79. Srinivasan, A., & King, R. D. (1996). Feature construction with inductive logic programming: a study of quantitative predictions of biological activity aided by structural attributes. In Proceedings of the sixth inductive logic programming workshop, Vol. 1314, pp. 89–104.Google Scholar
  80. Srinivasan, A., & Ramakrishnan, G. (2011). Parameter screening and optimisation for ILP using designed experiments. Journal of Machine Learning Research, 12, 627–662.MATHGoogle Scholar
  81. Srinivasan, A., Muggleton, S. H., Sternberg, M. J. E., & King, R. D. (1996). Theories for mutagenicity: A study in first-order and feature-based induction. Artificial Intelligence, 85(1–2), 277–299.CrossRefGoogle Scholar
  82. Sun, Z. (2014). Parallel feature selection based on mapreduce. In Computer engineering and networking, volume 277 of Lecture Notes in Electrical Engineering, pp. 299–306 .Google Scholar
  83. Sutton, R. (1992) Adapting bias by gradient descent: An incremental version of delta-bar-delta. In Proceeding of tenth national conference on artificial intelligence, pp. 171–176.Google Scholar
  84. Tao, T. (2011). An introduction to measure theory.Google Scholar
  85. Tsitsiklis, J. N. (1984). Problems in decentralized decision making and computation. PhD thesis, Department of EECS, MIT.Google Scholar
  86. Tsitsiklis, J. N., Bertsekas, D. P., & Athans, M. (1986). Distributed asynchronous deterministic and stochastic gradient optimization algorithms. IEEE Transactions on Automatic Control, 31.Google Scholar
  87. Varga, R. S. (1962). Matrix iterative analysis.Google Scholar
  88. Zelezny, F., Srinivasan, A., & Page, C. D, Jr. (2006). Randomised restarted search in ilp. Machine Learning, 64(1–3), 183–208.CrossRefMATHGoogle Scholar
  89. Zhao, Z., Cox, J., Duling, D., & Sarle, W. (2012). Massively parallel feature selection: An approach based on variance preservation. ECML/PKDD, 7523, 237–252.MATHGoogle Scholar
  90. Zhou, Y., Porwal, U., Zhang, C., Ngo, H. Q., Nguyen, L., Ré, C., & Govindaraju, V. (2014). Parallel feature selection inspired by group testing. In Annual conference on neural information processing systems, pp. 3554–3562.Google Scholar
  91. Zinkevich, M., Weimer, M., Smola, A. J., & Li, L. (2010). Parallelized stochastic gradient descent. In NIPS, Vol. 4, p. 4.Google Scholar

Copyright information

© The Author(s) 2017

Authors and Affiliations

  1. 1.Department of Management Science and SystemsUniversity at BuffaloNew YorkUSA
  2. 2.Department of Computer Sciences and Information SystemsBITS-PilaniGoaIndia

Personalised recommendations