Abstract
Integrating logical reasoning and machine learning by approximating logical inference with differentiable operators is a widely used technique in the field of Neuro-Symbolic Learning. However, some differentiable operators could introduce significant biases during backpropagation, which can degrade the performance of Neuro-Symbolic systems. In this paper, we demonstrate that the loss functions derived from fuzzy logic operators commonly exhibit a bias, referred to as Implication Bias. To mitigate this bias, we propose a simple yet efficient method to transform the biased loss functions into Reduced Implication-bias Logic Loss (RILL). Empirical studies demonstrate that RILL outperforms the biased logic loss functions, especially when the knowledge base is incomplete or the supervised training data is insufficient.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Availability of data and material
All datasets are publicly available.
Code availability
Available at https://git.nju.edu.cn/Alkane/clion.git.
Notes
Limited by space, this definition is not complete, the more rigorous definition can be referred to Klement et al. (2013)
A detailed empirical study on other kinds of fuzzy operators can be seen in the appendix.
References
Badreddine, Samy., & , Artur S d’Avila. (2022) Luciano Serafini, and Michael Spranger. Logic tensor networks. Artificial Intelligence Journal. https://doi.org/10.1016/j.artint.2021.103649.
Cignoli, Roberto. (2007). The Algebras of Łukasiewicz Many-Valued Logic: A Historical Overview. https://doi.org/10.1007/978-3-540-75939-3_5
Clark, Keith L. (1978) Negation as failure. In Logic and data bases, pages 293–322.
Cohen, William W., Yang, Fan, & Mazaitis, Kathryn. (2020). Tensorlog: A probabilistic database implemented using deep-learning infrastructure. Journal of Artificial Intelligence Research. https://doi.org/10.1613/jair.1.11944
Dai, Wang-Zhou., Xu, Qiu-Ling., Yu, Yang., Zhou, & Zhi-Hua (2019) Bridging machine learning and logical reasoning by abductive learning. In Conference on Neural Information Processing Systems
Darwiche,Adnan (2011) SDD: A new canonical representation of propositional knowledge bases. In International Joint Conference on Artificial Intelligence, pages 819–826.
d’Avila Garcez, Artur S.., Gori, Marco., Lamb, Luís C.., Serafini, Luciano., Spranger, Michael., & Tran, Son N. (2019) Neural-symbolic computing: An effective methodology for principled integration of machine learning and reasoning. Journal of Applied Logics
Deng, Li. (2012). The MNIST database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine. https://doi.org/10.1109/MSP.2012.2211477
Enderton, Herbert B. (1972) A mathematical introduction to logic.
Fischer, Marc., Balunovic, Mislav., Drachsler-Cohen, Dana., Gehr, Timon., Zhang, Ce., & Vechev, Martin T. (2019) DL2: training and querying neural networks with logic. In International Conference on Machine Learning
Geirhos, Robert, Jacobsen, Jörn-Henrik., Michaelis, Claudio, Zemel, Richard S., Brendel, Wieland, Bethge, Matthias, & Wichmann, Felix A. (2020). Shortcut learning in deep neural networks. Nat. Mach. Intell.https://doi.org/10.1038/s42256-020-00257-z
Gerla, Brunella, & Rovere, Massimo Dalla (2011). Nilpotent minimum fuzzy description logics. In European Society for Fuzzy Logic and Technology. https://doi.org/10.2991/eusflat.2011.127
Geoffrey, G. (1994). Towell and Jude W. Artificial Intelligence Journal: Shavlik. Knowledge-based artificial neural networks.
Giannini, Francesco, Marra, Giuseppe, Diligenti, Michelangelo, Maggini, Marco, & Gori, Marco. (2019). On the relation between loss functions and t-norms. In International Conference on Inductive Logic Programming. https://doi.org/10.1007/978-3-030-49210-6_4
He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, & Sun, Jian. (2016). Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition. https://doi.org/10.1109/CVPR.2016.90
Hoernle, Nick., Karampatsis, Rafael-Michael., Belle, Vaishak., & Gal, Kobi (2022) Multiplexnet: Towards fully satisfied logical constraints in neural networks. In AAAI Conference on Artificial Intelligence.
Klement, E.P., Mesiar, R., & Pap, E. (2013) Triangular Norms.
Krizhevsky, Alex., & Hinton, Geoffrey. (2009) et al. Learning multiple layers of features from tiny images. Technical Report TR 2009
Li, Tao., & Srikumar, Vivek (2019) Augmenting neural networks with first-order logic. In Anna Korhonen, David R. Traum, and Lluís Màrquez, editors, Annual Meeting of the Association for Computational Linguistics. https://doi.org/10.18653/v1/p19-1028.
Maas, Andrew L., Hannun, Awni Y., & Ng, Andrew Y., (2013) et al. Rectifier nonlinearities improve neural network acoustic models. In International Conference on Machine Learning
Manhaeve, Robin., Dumancic, Sebastijan., Kimmig, Angelika., Demeester, Thomas., & Raedt,Luc De (2018) Deepproblog: Neural probabilistic logic programming. In Conference on Neural Information Processing Systems
Marra, Giuseppe., Dumancic, Sebastijan., Manhaeve, Robin., & Raedt, Luc De (2021) From statistical relational to neural symbolic artificial intelligence. CoRR.
Müller, Rafael., Kornblith, Simon., & Hinton, Geoffrey E. (2019) When does label smoothing help? In Conference on Neural Information Processing Systems
Natarajan, Nagarajan., Dhillon, Inderjit S., Ravikumar, Pradeep., & Tewari, Ambuj (2013) Learning with noisy labels. In Conference on Neural Information Processing Systems.
Paad, Akbar (2016) Relation between (fuzzy) gödel ideals and (fuzzy) boolean ideals in bl-algebras. Discussiones Mathematicae General Algebra and Applications
Phoungphol, Piyaphol, Zhang, Yanqing, & Zhao, Yichuan. (2012). Robust multiclass classification for learning from imbalanced biomedical data. Tsinghua Science and technology, 6, 619–628.
Raedt, Luc De., Dumancic, Sebastijan., Manhaeve, Robin., & Marra, Giuseppe (2020) From statistical relational to neuro-symbolic artificial intelligence. In International Joint Conference on Artificial Intelligence. https://doi.org/10.24963/ijcai.2020/688.
Reiter, Raymond. (1978). On Closed World Data. Bases.https://doi.org/10.1007/978-1-4684-3384-5_3
Reiter, Raymond. (1980). A logic for default reasoning. AI. https://doi.org/10.1016/0004-3702(80)90014-4
Roychowdhury, Soumali, Diligenti, Michelangelo, & Gori, Marco. (2021). Regularizing deep networks with prior knowledge: A constraint-based approach. Knowledge-Based System. https://doi.org/10.1016/j.knosys.2021.106989
van Krieken, Emile, Acar, Erman, & van Harmelen, Frank. (2022). Analyzing differentiable fuzzy logic operators. Artificial Intelligence Journal. https://doi.org/10.1016/j.artint.2021.103602
Xiao, Han., Rasul, Kashif., & Vollgraf, Roland (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning. CoRR
Xu, Jingyi., Zhang, Zilu., Friedman, Tal., Liang, Yitao., & Broeck, Guy Van den (2018) A semantic loss function for deep learning with symbolic knowledge. In International Conference on Machine Learning
Xu, E., Yu, Z., Li, N., Cui, H., Yao, L., & Guo, B. (2023). Quantifying predictability of sequential recommendation via logical constraints. Frontiers of Computer Science, 17, https://doi.org/10.1007/s11704-022-2223-1.
Yang, Zhun., Lee, Joohyung., & Park, Chiyoun (2022) Injecting logical constraints into neural networks via straight-through estimators. In International Conference on Machine Learning
Zagoruyko, Sergey., & Komodakis, Nikos (2016) Wide residual networks. In British Machine Vision Conference
Zhou, Zhi-Hua. (2019). Abductive learning: towards bridging machine learning and logical reasoning. Science China Information Sciences. https://doi.org/10.1007/s11432-018-9801-4
Funding
This research was supported by NSFC (62076121, 61921006), and Major Program of Hubei Province (2023BAA024).
Author information
Authors and Affiliations
Contributions
H-YH conceived the central idea of this paper and contributed to the writing and execution of the main experiments. W-ZD contributed to the enhancement of the RILL approach and the refinement of this paper. ML provided valuable feedback, suggestions, and editing services for this manuscript.
Corresponding author
Ethics declarations
Conflict of interest
Not applicable.
Ethical approval
Not applicable.
Consent to participate
Not applicable.
Consent for publication
Not applicable.
Additional information
Editors: Vu Nguyen, Dani Yogatama.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Discussion with other kinds of fuzzy operators
In this section, we investigate different types of fuzzy operators and experimentally validate their effectiveness while maintaining the implication biased. When selecting fuzzy operators for NeSy, it is crucial to consider their smoothness and ease of optimization. Therefore, we concentrate on commonly used fuzzy operators that possess these desirable properties. Fuzzy operators that do not have a smooth gradient will be disregarded in our analysis.
1.1 A.1 Analysis
Sigmoidal van Krieken et al. (2022) proposed an operator which is smoothed Reichenbach operator by a sigmoid function. This implication likelihood function is defined as follows:
where \(s>0,b_0\in {\mathbb {R}},\sigma (\cdot ) = \frac{1}{1+e^x}\) denotes the sigmoid function. Substituting \(d=\frac{1+e^{-s\left( 1+b_{0}\right) }}{e^{-b_{0} s}-e^{-s\left( 1+b_{0}\right) }}, h =(1+e^{-s\cdot b_0}),f=\sigma \left( s \cdot \left( I(p,q)+b_{0}\right) \right) \), we find:
That is to say when I(p, q) is \(\delta \)-Confidence Monotonic, \(\sigma _I(p,q)\) will be \(\delta \)-Confidence Monotonic too. This means logic loss derived from \(\sigma _I\) will become implication biased.
Łukasiewicz Łukasiewicz implication likelihood (Cignoli, 2007) was defined as follows:
It is easy to calculate the gradient of this likelihood, \(\frac{\partial I_{LK}}{\partial p} = -1 \cdot {\mathbb {I}}[p>q]\). It turns out this is also implication biased.
G\(\ddot{\text {o}}\) del G\(\ddot{\text {o}}\)del implication likelihood (Paad, 2016) was defined as follows:
Also, the gradient of this likelihood \(\frac{\partial I_{G}}{\partial p} = -1 \cdot {\mathbb {I}}[p+q<1]\) indicates this operator is implication biased.
Nilpotent Nilpotent implication likelihood (Gerla & Rovere, 2011) was defined as follows:
Also, the gradient of this likelihood \( \frac{\partial I_{N}}{\partial p} = -1 \cdot {\mathbb {I}}[(p+q<1) \& (p >q)]\) indicates this operator is implication biased.
1.2 A.2 Empirical study
In this section, we present the empirical results of the above-analyzed operators for validation, with a particular emphasis on incomplete knowledge bases (especially Add-MNIST) and insufficient labeled data (especially Add-CIFAR10) scenarios. All experiments follow the same settings used in Sect. 6.1.1.
As shown in Fig. 7, implication bias significantly harms the performance of the model, especially when the knowledge base is incomplete or the amount of supervised information is not sufficient, which further supports our above analysis.
Appendix B: Discussion about Clark’s completion
RILL and Clark’s completion both aim to reduce the uncertainty of negative information. While RILL does not alter the information in the knowledge base, but instead reduces the importance of weak samples, causing the model to pay more attention to samples that are more relevant to the given rule.
Surprisingly, explicitly applying Clark’s completion in a NeSy system may not be helpful. There are two reasons behind this claim.
First, a explicitly Clark’s completion need to replace \(\rightarrow \) to \(\leftrightarrow \). For example:
will be replaced as \(A\leftrightarrow (A_1\vee A_2\vee \cdots \vee A_n)\).
However, the fuzzy operator is unsuitable for approximating a rule with many atoms (Marra et al., 2021). An example is the n-ary Łukasiewicz strong disjunction \(F_{\vee }(x_1,\cdots ,x_n) = \min (1,x_1+\cdots +x_n)\). Although all \(x_i\) can be very small, this approximation of disjunction will give a value near 1. Because Clark’s completion will increase the number of atoms in the logical rule, the knowledge base may suffer from this problem after completion.
Second, in a NeSy system, if the knowledge base is incomplete, replace \(\rightarrow \) to \(\leftrightarrow \) will change the information in the knowledge base, which may induce wrong information. In logic programming, Clark’s completion will not change the soundness of the system, while in the NeSy setting with a data-driven approach, it may not be promised.
Here we adopt the same experimental setting of incomplete knowledge base case in Sect. 6.1.1 to validate the performance of Clark’s Completion. As depicted in Fig. 8, when the knowledge base becomes incomplete, completion of the knowledge base will not help improve the model’s performance because it introduces wrong information.
Appendix C: Sensitivity analysis
Both \(\text {RILL}_{hinge}\) and \(\text {RILL}_{l2+hinge}\) contain a hyper-parameter \(\epsilon \) in their definition. In this section, we investigate the sensitivity of \(\epsilon \) and its impact on the model’s accuracy in the Add-MNIST experiment when the knowledge base incompleteness is 40%. Figure 9 displays the relationship between \(\epsilon \) and the model’s accuracy.
The results indicate that when the threshold \(\epsilon \) decreases to a certain value (in Fig. 9, it is 0.001), the model’s performance drops, suggesting the existence of a sensitive region that could be related to weak samples in addressing the problem of implication bias. When the threshold \(\epsilon \) is above this value, the performance of RILL is stable but slightly decreasing. This decrease may be due to the increased number of weak samples, which results in the loss of some useful information.
Appendix D: Details of experiments
In this section, we will provide more details of our experiments.
Implementation of Semantic Loss The formal definition of Semantic Loss requires the conversion of CNF (conjunctive normal form) into SDD (sentential decision diagram), which can become impractical when dealing with knowledge bases that use a large number of predicates and complex rules due to the heavy computational burden. To address this issue, we propose an alternative approach where each rule in the knowledge base is converted separately into wmc (weighted model counting) models and then combined at the end. Although explicit repetitive terms are reduced during the combining phase, detecting implicit repetitive terms is computationally expensive, and thus they remain unchanged. This approach enables us to implement Semantic Loss efficiently while still accounting for the complexity of the knowledge base.
However, even with this optimization, the cost of using Semantic Loss is still much higher than that of using RILL, with a cost that is around three times higher. As a result, Semantic Loss may not be practical to use in many cases, particularly when dealing with large or complex knowledge bases.
Details of Task 1 The backbone for both MNIST and FashionMNIST datasets is a three-layer Multilayer Perceptron (MLP) with a width of each layer being [256,512,10], and the activation function is Rectified Linear Unit (ReLU) (Maas et al., 2013). In contrast, the backbone for CIFAR10 is ResNet9 (He et al., 2016). The value of \(\lambda \) for both Fuzzy and RILL logic loss is 0.7. For Semantic Loss, we sample \(\lambda \) from \(\{0.001,0.005,0.01,0.05,0.1,0.5\}\) and choose the best one, which is 0.5 in this experiment. The learning rate is set to 0.0001, with a decay rate of 0.7. The learning rate scheduler is set to StepLR, with a decay step of 60. The optimizer used is AdamW as default, and the weight decay rate is set to 5e-4.
Details of Task 2 For this task, we select WideResNet-28-8 (Zagoruyko & Komodakis, 2016) as the backbone architecture with two classification heads, one for class classification, and the other for super-class classification. The value of \(\lambda \) for both Fuzzy and RILL logic loss is 0.002, and for Semantic Loss, we still choose 0.5 as the default value. The learning rate is set to 0.005, with a decay rate of 0.9 and decay steps equal to 45. The learning rate scheduler is set to StepWithWarmUp, and the warm-up epoch is set as 5. The optimizer used is set as default, and momentum is set to 0.9.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
He, HY., Dai, WZ. & Li, M. Reduced implication-bias logic loss for neuro-symbolic learning. Mach Learn 113, 3357–3377 (2024). https://doi.org/10.1007/s10994-023-06436-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-023-06436-4