# Consensus-based modeling using distributed feature construction with ILP

- 212 Downloads

## Abstract

A particularly successful role for Inductive Logic Programming (ILP) is as a tool for discovering useful relational features for subsequent use in a predictive model. Conceptually, the case for using ILP to construct relational features rests on treating these features as functions, the automated discovery of which necessarily requires some form of first-order learning. Practically, there are now several reports in the literature that suggest that augmenting any existing feature with ILP-discovered relational features can substantially improve the predictive power of a model. While the approach is straightforward enough, much still needs to be done to scale it up to explore more fully the space of possible features that can be constructed by an ILP system. This is in principle, infinite and in practice, extremely large. Applications have been confined to heuristic or random selections from this space. In this paper, we address this computational difficulty by allowing features and models to be constructed in a distributed manner. That is, there is a network of computational units, each of which employs an ILP engine to construct some small number of features and then builds a (local) model. We then employ an asynchronous consensus-based algorithm, in which neighboring nodes share information and update local models. This gossip-based information exchange results in the formation of non-stationary Markov chains. For a category of models (those with convex loss functions), it can be shown (using the Supermartingale Convergence Theorem) that the algorithm will result in all nodes converging to a consensus model. In practice, it may be slow to achieve this convergence. Nevertheless, our results on synthetic and real datasets suggest that in relatively short time the “best” node in the network reaches a model whose predictive accuracy is comparable to that obtained using more computational effort in a non-distributed setting (the best node is identified as the one whose weights converge first).

## Keywords

Inductive logic programming Consensus based learning Stochastic gradient descent Feature selection## Notes

### Acknowledgements

H.D. is also an adjunct assistant professor at the Department of Computer Science at IIIT, Delhi and an Affiliated Member of the Institute of Data Sciences, Columbia University, NY. A.S. also holds visiting positions at the School of CSE, University of New South Wales, Sydney; and at the Department of Computer Science, Oxford University, Oxford.

## References

- Agarwal, A., Chapelle, O., Dudík, M., & Langford, J. (2014). A reliable effective terascale linear learning system.
*Journal of Machine Learning Research*,*15*, 1111–1133.MathSciNetzbMATHGoogle Scholar - Agrawal, R. & Srikant, R. (1995). Mining sequential patterns. In
*ICDE*.Google Scholar - Antunes, C. & Oliveira, A. L. (2003). Generalization of pattern-growth methods for sequential pattern mining with gap constraints. In
*MLDM*.Google Scholar - Aseervatham, S., Osmani, A., & Viennet, E. (2006). bitSPADE: A lattice-based sequential pattern mining algorithm using bitmap representation. In
*ICDM*.Google Scholar - Ayres, J., Gehrke, J., Yiu, T., & Flannick, J. (2002). Sequential pattern mining using a bitmap representation. In
*KDD*.Google Scholar - Benezit, F., Dimakis, A. G., Thiran, P., & Vetterli, M. (2010). Order-optimal consensus through randomized path averaging.
*IEEE Transactions on Information Theory*,*56*(10), 5150–5167.MathSciNetCrossRefzbMATHGoogle Scholar - Bertsekas, D. P., & Tsitsiklis, J. N. (1997).
*Parallel and distributed computation: numerical methods*.Google Scholar - Blum, A. (1992). Learning boolean functions in an infinite attribute space.
*Machine Learning*,*9*(4), 373–386.zbMATHGoogle Scholar - Bottou, L. (2010). Large-scale machine learning with stochastic gradient descent. In
*Proceedings of the 19th international conference on computational statistics (COMPSTAT’2010)*, pp. 177–187.Google Scholar - Bottou, L., & Bousquet, O. (2011). The tradeoffs of large scale learning. In
*Optimization for machine learning*(pp. 351–368).Google Scholar - Bottou, L., & Bousquet, O. (2011). The tradeoffs of large scale learning. In
*Optimization for machine learning*(pp. 351–368). MIT Press.Google Scholar - Boyd, S., Ghosh, A., Prabhakar, B., & Shah, D. (2006). Randomized gossip algorithms.
*IEEE/ACM Transaction Network*,*14*(SI), 2508–2530.MathSciNetzbMATHGoogle Scholar - Boyd, S., Parikh, N., Chu, E., Peleato, B., & Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers.
*Foundations and Trends in Machine Learning*,*3*(1), 1–122.CrossRefzbMATHGoogle Scholar - Carlson, A., Cumby, C., Rosen, J., & Roth, D. (1999). The snow learning architecture. Technical Report UIUCDCS-R-99-2101, UIUC Computer Science Department, 5 .Google Scholar
- Chalamalla, A., Negi, S., Venkata Subramaniam, L., & Ramakrishnan, G. (2008). Identification of class specific discourse patterns. In
*CIKM*, pp. 1193–1202.Google Scholar - Christoudias, C. M., Urtasun, R., & Darrell, T. (2008). Unsupervised distributed feature selection for multi-view object recognition. Technical Report MIT-CSAIL-TR-2008-009, MIT.Google Scholar
- Cybenko, G. (1989). Dynamic load balancing for distributed memory multiprocessors.
*Proceedings of the Journal of Parallel and Distributed Computing*,*7*, 279–301.CrossRefGoogle Scholar - Darken, C., & Moody, J. (1990). Note on learning rate schedules for stochastic optimization. In
*Proceedings of the conference on advances in neural information processing systems*, pp. 832–838.Google Scholar - Das, K., Bhaduri, K., & Kargupta, H. (2010). A local asynchronous distributed privacy preserving feature selection algorithm for large peer-to-peer networks.
*Knowledge and Information Systems*,*24*(3), 341–367.CrossRefGoogle Scholar - Davis, J., Burnside, E., de Castro Dutra, I., Page D., & Costa, V. S. (2005a). An integrated approach to learning Bayesian networks of rules. In
*Machine Learning: ECML 2005*, pp. 84–95.Google Scholar - Davis, J., Burnside, E. S., de Castro Dutra, I., Page, D., Ramakrishnan, R., Costa, V. S., & Shavlik, J. W. (2005b). View learning for statistical relational learning: With an application to mammography. In
*Proceedings of the nineteenth international joint conference on artificial intelligence*, pp. 677–683.Google Scholar - Davis, J., Ong, I., Struyf, J., Burnside, E., Page, D., & Costa, V. S. (2007). Change of representation for statistical relational learning. In
*Proceedings of the 20th international joint conference on artificial intelligence*, pp. 2719–2726.Google Scholar - Dehaspe, L., & De Raedt, L. (1995). Parallel inductive logic programming. Machine learning and knowledge discovery in databases. In
*Proceedings of the MLnet familiarization workshop on statistics*(pp. 112–117).Google Scholar - Dekel, O., Gilad-Bachrach, R., Shamir, O., & Xiao, L. (2012). Optimal distributed online prediction using mini-batches.
*Journal of Machine Learning Research*,*13*(1), 165–202.MathSciNetzbMATHGoogle Scholar - Dimakis, A. G., Sarwate, A. D., & Wainwright, M. J. (2006). Geographic gossip: Efficient aggregation for sensor networks. In
*The fifth international conference on information processing in sensor networks*, pp. 69–76.Google Scholar - Duchi, J., Agarwal, A., & Wainwright, M. (2012). Dual averaging for distributed optimization: Convergence analysis and network scaling.
*IEEE Transactions on Automatic Control*,*57*(3), 592–606.MathSciNetCrossRefzbMATHGoogle Scholar - Džeroski, S. (1993). Handling imperfect data in inductive logic programming. In
*Proceedings of the Fourth Scandinavian Conference on Artificial Intelligence*, pp. 111–125.Google Scholar - Fischer, J. M., Lynch, N. A., & Paterson, M. S. (1985). Impossibility of distributed consensus with one faulty process.
*Journal of the ACM*,*32*(2), 374–382.MathSciNetCrossRefzbMATHGoogle Scholar - Fonseca, N. A., Silva, F., & Camacho, R. (2005). Strategies to parallelize ILP systems. In
*Proceedings of the 15th international conference on inductive logic programming*, pp. 136–153.Google Scholar - Garcia, D. J., Hall, L. O, Goldgof, D. B. & Kramer K. (2006). A parallel feature selection algorithm from random subsets. In
*Proceedings of the international workshop on parallel data mining*.Google Scholar - Garofalakis, M. N., Rastogi, R., & Shim, K. (1999). Spirit: Sequential pattern mining with regular expression constraints. In
*VLDB*.Google Scholar - Han, Y., & Wang, J. (2009). An l1 regularization framework for optimal rule combination. In
*ECML/PKDD*.Google Scholar - Jawanpuria, P., Nath, J. S., & Ramakrishnan, G. (2011). Efficient rule ensemble learning using hierarchical kernels. In
*ICML*, pp. 161–168.Google Scholar - Jelasity, M., Guerraoui, R., Kermarrec, A., & Steen, M. (2004). The peer sampling service: Experimental evaluation of unstructured gossip-based implementations. In
*Middleware 2004*, Vol. 3231, pp. 79–98.Google Scholar - Jelasity, M., Montresor, A., & Babaoglu, Ö. (2005). Gossip-based aggregation in large dynamic networks.
*ACM Transaction on Computational Systems*,*23*(3), 219–252.CrossRefGoogle Scholar - Ji, X., Bailey, J., & Dong, G. (2006). Mining minimal distinguishing subsequence patterns with gap constraints.
*Knowledge and Information Systems*.Google Scholar - John, G. H., Kohavi, R., & Pfleger, K. (1994). Irrelevant features and the subset selection problem. In
*Proceedings of the eleventh international conference on machine learning*, pp. 121–129 .Google Scholar - Joshi, S., Ramakrishnan, G., & Srinivasan, A. (2008). Feature construction using theory-guided sampling and randomised search. In
*ILP*, pp 140–157.Google Scholar - Kempe, D., Dobra, A., & Gehrke, J. (2003). Gossip-based computation of aggregate information. In
*Proceedings of 44th annual IEEE symposium on foundations of computer science*, pp. 482–491.Google Scholar - King, R. D., & Srinivasan, A. (1996). Prediction of rodent carcinogenicity bioassays from molecular structure using inductive logic programming.
*Environmental Health Perspectives*,*104*, 1031–1040.CrossRefGoogle Scholar - King, R. D., Muggleton, S. H., Srinivasan, A., & Sternberg, M. J. (1996). Structure-activity relationships derived by machine learning: The use of atoms and their bond connectivities to predict mutagenicity by inductive logic programming.
*Proceedings of the National academy of Sciences of the United States of America*,*93*(1), 438–42.CrossRefGoogle Scholar - Kudo, T., Maeda, E., & Matsumoto, Y. (2004). An application of boosting to graph classification. In
*NIPS*.Google Scholar - Landwehr, N., Kersting, K., & Raedt, L. D. (2007). Integrating naive bayes and foil.
*Journal of Machine Learning Research*,*8*, 481–507.zbMATHGoogle Scholar - Langford, J., Smola, A., & Zinkevich, M. (2009). Slow learners are fast. In
*Advances in neural information processing systems*, pp. 2331–2339.Google Scholar - Larson, J., & Michalski, R. S. (1977). Inductive inference of VL decision rules.
*SIGART Bulletin*,*63*, 38–44.CrossRefGoogle Scholar - Lavrac, N., & Dzeroski, S. (1993).
*Inductive logic programming: Techniques and applications*(p. 10001). New York, NY: Routledge.zbMATHGoogle Scholar - Littlestone, N. (1988). Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm.
*Machine Learning*,*2*(4), 285–318.Google Scholar - Liu, H., & Motoda, H. (1998).
*Feature selection for knowledge discovery and data mining*. Boston: Kluwer Academic Publishers.CrossRefzbMATHGoogle Scholar - Lopez, F. G., Torres, M. G. A., Batista, B. M., Perez, J. A. M., & Moreno-Vega, J. M. (2006). Solving feature subset selection problem by a parallel scatter search.
*European Journal of Operational Research*,*169*(2), 477–489.MathSciNetCrossRefzbMATHGoogle Scholar - Mangasarian, L. (1995). Parallel gradient distribution in unconstrained optimization.
*SIAM Journal on Control and Optimization*,*33*(6), 1916–1925.MathSciNetCrossRefzbMATHGoogle Scholar - Michie, D., Bain, M., & Hayes-Michie, J. (1990). Cognitive models from subcognitive skills. In M.J. Grimble J. McGhee and P. Mowforth, (Eds.),
*Knowledge-based systems for industrial control*(pp. 71–99). Peter Peregrinus for IEE, London.Google Scholar - Montresor, A., & Jelasity, M., PeerSim. (2009). A scalable P2P simulator. In
*Proceedings of the 9th international conference on Peer-to-Peer (P2P’09)*, pp. 99–100 .Google Scholar - Muggleton, S. (1994). Inductive logic programming: Derivations, successes and shortcomings.
*SIGART Bulletin*,*5*(1), 5–11.CrossRefGoogle Scholar - Muggleton, S. (1995). Inverse entailment and progol.
*New Generation Computing*,*13*(3), 245–286.CrossRefGoogle Scholar - Muggleton, S. H., Santos, J. C. A., & Tamaddoni-Nezhad, A. (2008). TopLog: ILP using a logic program declarative bias.
*Logic Programming*,*5366*, 687–692.zbMATHGoogle Scholar - Nagesh, A., Ramakrishnan, G., Chiticariu, L., Krishnamurthy, R., Dharkar, A., & Bhattacharyya, P. (2012). Towards efficient named-entity rule induction for customizability. In
*EMNLP-CoNLL*, pp. 128–138.Google Scholar - Nair, N., Saha, A., Ramakrishnan, G., & Krishnaswamy, S. (2012). Rule ensemble learning using hierarchical kernels in structured output spaces. In
*AAAI*.Google Scholar - Nienhuys-Cheng, S., & De Wolf, R. (1997)
*Foundations of inductive logic programming*. New York: Springer.Google Scholar - Niu, F., Recht, B., Ré, C., & Wright, S. J. (2011). Hogwild!: A lock-free approach to parallelizing stochastic gradient descent.
*Advances in Neural Information Processing Systems*,*24*, 693–701.Google Scholar - Nowozin, S., Bakir, G., & Tsuda, K. (2007). Discriminative subsequence mining for action classification. In
*CVPR*.Google Scholar - Pei, J. (2004). Mining sequential patterns by pattern-growth: The PrefixSpan approach.
*Journal of Machine Learning Research*, 16-11.Google Scholar - Pei, J., Han, J., & Wang, W. (2005). Constraint-based sequential pattern mining: the pattern-growth methods.
*Journal of Intelligent Information Systems*.Google Scholar - Pei, J., Han, J., & Yan, X. (2004). From sequential pattern mining to structured pattern mining: A pattern-growth approach.
*Journal of Computer Science and Technology*,*9*(3), 257–279.MathSciNetGoogle Scholar - Plotkin, G.D. (1971).
*Automatic methods of inductive inference*. PhD thesis, Edinburgh University.Google Scholar - Ramakrishnan, G., Joshi, S., Balakrishnan, S., & Srinivasan, A. (2007). Using ILP to construct features for information extraction from semi-structured text. In
*ILP*, pp. 211–224.Google Scholar - Ratnaparkhi, A. (1999). Learning to parse natural language with maximum entropy models.
*Machine Learning*,*34*(1), 151–175.CrossRefzbMATHGoogle Scholar - Roth, D. (1998). Learning to resolve natural language ambiguities: A unified approach. In
*Proceedings of the innovative applications of artificial intelligence*, pp. 806–813.Google Scholar - Rückert, U. & Kramer, S. (2003). Stochastic local search in k-term dnf learning. In
*Proceedings of the 20th international conference on machine learning (ICML-03)*, pp. 648–655.Google Scholar - Rückert, U., Kramer, S., & De Raedt, L. (2002). Phase transitions and stochastic local search in k-term dnf learning. In
*Proceedings of the 13th European conference on machine learning*, pp. 405–417.Google Scholar - Ryan, M., Hall, K., & Mann, G. (2010). Distributed training strategies for the structured perceptron. In
*The annual conference of the north American chapter of the association for computational linguistics*, pp. 456–464.Google Scholar - Saha, A., Srinivasan, A., & Ramakrishnan, G. (2012). What kinds of relational features are useful for statistical learning? In
*ILP*.Google Scholar - Sanov, I. N. (1957). On the probability of large deviations of random variables.
*Mat. Sbornik*,*42*, 11–44.MathSciNetGoogle Scholar - Shah, D. (2009). Gossip algorithms.
*Foundations and Trends Netwroking*,*3*(1), 1–125.MathSciNetzbMATHGoogle Scholar - Singh, S., Kubica, J., Larsen, S., & Sorokina, D., (2009). Parallel large scale feature selection for logistic regression. In
*SDM*, pp. 1172–1183.Google Scholar - Specia, L., Srinivasan, A., Ramakrishnan, G., & Graças Volpe Nunes, M. (2006). Word sense disambiguation using inductive logic programming. In
*ILP*, pp. 409–423.Google Scholar - Specia, L., Srinivasan, A., Joshi, S., Ramakrishnan, G., & Gracas, M. (2009). An investigation into feature construction to assist word sense disambiguation.
*Machine Learning*,*76*(1), 109–136.CrossRefGoogle Scholar - Srinivasan, A. & Bain, M. (2014). An empirical study of on-line models for relational data streams. Technical Report 201401, School of Computer Science and Engineering, UNSW.Google Scholar
- Srinivasan, A. (1999). The aleph manual.Google Scholar
- Srinivasan, A., & King, R. D. (1996). Feature construction with inductive logic programming: a study of quantitative predictions of biological activity aided by structural attributes. In
*Proceedings of the sixth inductive logic programming workshop*, Vol. 1314, pp. 89–104.Google Scholar - Srinivasan, A., & Ramakrishnan, G. (2011). Parameter screening and optimisation for ILP using designed experiments.
*Journal of Machine Learning Research*,*12*, 627–662.zbMATHGoogle Scholar - Srinivasan, A., Muggleton, S. H., Sternberg, M. J. E., & King, R. D. (1996). Theories for mutagenicity: A study in first-order and feature-based induction.
*Artificial Intelligence*,*85*(1–2), 277–299.CrossRefGoogle Scholar - Sun, Z. (2014). Parallel feature selection based on mapreduce. In
*Computer engineering and networking*, volume 277 of*Lecture Notes in Electrical Engineering*, pp. 299–306 .Google Scholar - Sutton, R. (1992) Adapting bias by gradient descent: An incremental version of delta-bar-delta. In
*Proceeding of tenth national conference on artificial intelligence*, pp. 171–176.Google Scholar - Tao, T. (2011).
*An introduction to measure theory*.Google Scholar - Tsitsiklis, J. N. (1984).
*Problems in decentralized decision making and computation*. PhD thesis, Department of EECS, MIT.Google Scholar - Tsitsiklis, J. N., Bertsekas, D. P., & Athans, M. (1986). Distributed asynchronous deterministic and stochastic gradient optimization algorithms.
*IEEE Transactions on Automatic Control*,*31*.Google Scholar - Varga, R. S. (1962).
*Matrix iterative analysis*.Google Scholar - Zelezny, F., Srinivasan, A., & Page, C. D, Jr. (2006). Randomised restarted search in ilp.
*Machine Learning*,*64*(1–3), 183–208.CrossRefzbMATHGoogle Scholar - Zhao, Z., Cox, J., Duling, D., & Sarle, W. (2012). Massively parallel feature selection: An approach based on variance preservation.
*ECML/PKDD*,*7523*, 237–252.zbMATHGoogle Scholar - Zhou, Y., Porwal, U., Zhang, C., Ngo, H. Q., Nguyen, L., Ré, C., & Govindaraju, V. (2014). Parallel feature selection inspired by group testing. In
*Annual conference on neural information processing systems*, pp. 3554–3562.Google Scholar - Zinkevich, M., Weimer, M., Smola, A. J., & Li, L. (2010). Parallelized stochastic gradient descent. In
*NIPS*, Vol. 4, p. 4.Google Scholar