## Abstract

Integrating logical reasoning and machine learning by approximating logical inference with differentiable operators is a widely used technique in the field of Neuro-Symbolic Learning. However, some differentiable operators could introduce significant biases during backpropagation, which can degrade the performance of Neuro-Symbolic systems. In this paper, we demonstrate that the loss functions derived from fuzzy logic operators commonly exhibit a bias, referred to as *Implication Bias*. To mitigate this bias, we propose a simple yet efficient method to transform the biased loss functions into *Reduced Implication-bias Logic Loss (RILL)*. Empirical studies demonstrate that RILL outperforms the biased logic loss functions, especially when the knowledge base is incomplete or the supervised training data is insufficient.

### Similar content being viewed by others

### Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.## Availability of data and material

All datasets are publicly available.

## Code availability

Available at https://git.nju.edu.cn/Alkane/clion.git.

## Notes

Limited by space, this definition is not complete, the more rigorous definition can be referred to Klement et al. (2013)

A detailed empirical study on other kinds of fuzzy operators can be seen in the appendix.

## References

Badreddine, Samy., & , Artur S d’Avila. (2022) Luciano Serafini, and Michael Spranger. Logic tensor networks.

*Artificial Intelligence Journal*. https://doi.org/10.1016/j.artint.2021.103649.Cignoli, Roberto. (2007).

*The Algebras of Łukasiewicz Many-Valued Logic: A Historical Overview.*https://doi.org/10.1007/978-3-540-75939-3_5Clark, Keith L. (1978) Negation as failure. In

*Logic and data bases*, pages 293–322.Cohen, William W., Yang, Fan, & Mazaitis, Kathryn. (2020). Tensorlog: A probabilistic database implemented using deep-learning infrastructure.

*Journal of Artificial Intelligence Research*. https://doi.org/10.1613/jair.1.11944Dai, Wang-Zhou., Xu, Qiu-Ling., Yu, Yang., Zhou, & Zhi-Hua (2019) Bridging machine learning and logical reasoning by abductive learning. In

*Conference on Neural Information Processing Systems*Darwiche,Adnan (2011) SDD: A new canonical representation of propositional knowledge bases. In

*International Joint Conference on Artificial Intelligence*, pages 819–826.d’Avila Garcez, Artur S.., Gori, Marco., Lamb, Luís C.., Serafini, Luciano., Spranger, Michael., & Tran, Son N. (2019) Neural-symbolic computing: An effective methodology for principled integration of machine learning and reasoning.

*Journal of Applied Logics*Deng, Li. (2012). The MNIST database of handwritten digit images for machine learning research.

*IEEE Signal Processing Magazine*. https://doi.org/10.1109/MSP.2012.2211477Enderton, Herbert B. (1972)

*A mathematical introduction to logic*.Fischer, Marc., Balunovic, Mislav., Drachsler-Cohen, Dana., Gehr, Timon., Zhang, Ce., & Vechev, Martin T. (2019) DL2: training and querying neural networks with logic. In

*International Conference on Machine Learning*Geirhos, Robert, Jacobsen, Jörn-Henrik., Michaelis, Claudio, Zemel, Richard S., Brendel, Wieland, Bethge, Matthias, & Wichmann, Felix A. (2020). Shortcut learning in deep neural networks.

*Nat. Mach. Intell.*https://doi.org/10.1038/s42256-020-00257-zGerla, Brunella, & Rovere, Massimo Dalla (2011). Nilpotent minimum fuzzy description logics.

*In European Society for Fuzzy Logic and Technology*. https://doi.org/10.2991/eusflat.2011.127Geoffrey, G. (1994).

*Towell and Jude W*. Artificial Intelligence Journal: Shavlik. Knowledge-based artificial neural networks.Giannini, Francesco, Marra, Giuseppe, Diligenti, Michelangelo, Maggini, Marco, & Gori, Marco. (2019). On the relation between loss functions and t-norms.

*In International Conference on Inductive Logic Programming*. https://doi.org/10.1007/978-3-030-49210-6_4He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, & Sun, Jian. (2016). Deep residual learning for image recognition.

*In IEEE Conference on Computer Vision and Pattern Recognition*. https://doi.org/10.1109/CVPR.2016.90Hoernle, Nick., Karampatsis, Rafael-Michael., Belle, Vaishak., & Gal, Kobi (2022) Multiplexnet: Towards fully satisfied logical constraints in neural networks. In

*AAAI Conference on Artificial Intelligence*.Klement, E.P., Mesiar, R., & Pap, E. (2013)

*Triangular Norms*.Krizhevsky, Alex., & Hinton, Geoffrey. (2009) et al. Learning multiple layers of features from tiny images.

*Technical Report TR 2009*Li, Tao., & Srikumar, Vivek (2019) Augmenting neural networks with first-order logic. In Anna Korhonen, David R. Traum, and Lluís Màrquez, editors,

*Annual Meeting of the Association for Computational Linguistics*. https://doi.org/10.18653/v1/p19-1028.Maas, Andrew L., Hannun, Awni Y., & Ng, Andrew Y., (2013) et al. Rectifier nonlinearities improve neural network acoustic models. In

*International Conference on Machine Learning*Manhaeve, Robin., Dumancic, Sebastijan., Kimmig, Angelika., Demeester, Thomas., & Raedt,Luc De (2018) Deepproblog: Neural probabilistic logic programming. In

*Conference on Neural Information Processing Systems*Marra, Giuseppe., Dumancic, Sebastijan., Manhaeve, Robin., & Raedt, Luc De (2021) From statistical relational to neural symbolic artificial intelligence.

*CoRR*.Müller, Rafael., Kornblith, Simon., & Hinton, Geoffrey E. (2019) When does label smoothing help? In

*Conference on Neural Information Processing Systems*Natarajan, Nagarajan., Dhillon, Inderjit S., Ravikumar, Pradeep., & Tewari, Ambuj (2013) Learning with noisy labels. In

*Conference on Neural Information Processing Systems*.Paad, Akbar (2016) Relation between (fuzzy) gödel ideals and (fuzzy) boolean ideals in bl-algebras.

*Discussiones Mathematicae General Algebra and Applications*Phoungphol, Piyaphol, Zhang, Yanqing, & Zhao, Yichuan. (2012). Robust multiclass classification for learning from imbalanced biomedical data.

*Tsinghua Science and technology,**6*, 619–628.Raedt, Luc De., Dumancic, Sebastijan., Manhaeve, Robin., & Marra, Giuseppe (2020) From statistical relational to neuro-symbolic artificial intelligence. In

*International Joint Conference on Artificial Intelligence*. https://doi.org/10.24963/ijcai.2020/688.Reiter, Raymond. (1978). On Closed World Data.

*Bases.*https://doi.org/10.1007/978-1-4684-3384-5_3Reiter, Raymond. (1980).

*A logic for default reasoning. AI*. https://doi.org/10.1016/0004-3702(80)90014-4Roychowdhury, Soumali, Diligenti, Michelangelo, & Gori, Marco. (2021). Regularizing deep networks with prior knowledge: A constraint-based approach.

*Knowledge-Based System*. https://doi.org/10.1016/j.knosys.2021.106989van Krieken, Emile, Acar, Erman, & van Harmelen, Frank. (2022). Analyzing differentiable fuzzy logic operators.

*Artificial Intelligence Journal*. https://doi.org/10.1016/j.artint.2021.103602Xiao, Han., Rasul, Kashif., & Vollgraf, Roland (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning.

*CoRR*Xu, Jingyi., Zhang, Zilu., Friedman, Tal., Liang, Yitao., & Broeck, Guy Van den (2018) A semantic loss function for deep learning with symbolic knowledge. In

*International Conference on Machine Learning*Xu, E., Yu, Z., Li, N., Cui, H., Yao, L., & Guo, B. (2023). Quantifying predictability of sequential recommendation via logical constraints.

*Frontiers of Computer Science*,*17*, https://doi.org/10.1007/s11704-022-2223-1.Yang, Zhun., Lee, Joohyung., & Park, Chiyoun (2022) Injecting logical constraints into neural networks via straight-through estimators. In

*International Conference on Machine Learning*Zagoruyko, Sergey., & Komodakis, Nikos (2016) Wide residual networks. In

*British Machine Vision Conference*Zhou, Zhi-Hua. (2019). Abductive learning: towards bridging machine learning and logical reasoning.

*Science China Information Sciences*. https://doi.org/10.1007/s11432-018-9801-4

## Funding

This research was supported by NSFC (62076121, 61921006), and Major Program of Hubei Province (2023BAA024).

## Author information

### Authors and Affiliations

### Contributions

H-YH conceived the central idea of this paper and contributed to the writing and execution of the main experiments. W-ZD contributed to the enhancement of the RILL approach and the refinement of this paper. ML provided valuable feedback, suggestions, and editing services for this manuscript.

### Corresponding author

## Ethics declarations

### Conflict of interest

Not applicable.

### Ethical approval

Not applicable.

### Consent to participate

Not applicable.

### Consent for publication

Not applicable.

## Additional information

Editors: Vu Nguyen, Dani Yogatama.

### Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Appendices

### Appendix A: Discussion with other kinds of fuzzy operators

In this section, we investigate different types of fuzzy operators and experimentally validate their effectiveness while maintaining the implication biased. When selecting fuzzy operators for NeSy, it is crucial to consider their smoothness and ease of optimization. Therefore, we concentrate on commonly used fuzzy operators that possess these desirable properties. Fuzzy operators that do not have a smooth gradient will be disregarded in our analysis.

### 1.1 A.1 Analysis

**Sigmoidal** van Krieken et al. (2022) proposed an operator which is smoothed Reichenbach operator by a sigmoid function. This implication likelihood function is defined as follows:

where \(s>0,b_0\in {\mathbb {R}},\sigma (\cdot ) = \frac{1}{1+e^x}\) denotes the sigmoid function. Substituting \(d=\frac{1+e^{-s\left( 1+b_{0}\right) }}{e^{-b_{0} s}-e^{-s\left( 1+b_{0}\right) }}, h =(1+e^{-s\cdot b_0}),f=\sigma \left( s \cdot \left( I(p,q)+b_{0}\right) \right) \), we find:

That is to say when *I*(*p*, *q*) is \(\delta \)-Confidence Monotonic, \(\sigma _I(p,q)\) will be \(\delta \)-Confidence Monotonic too. This means logic loss derived from \(\sigma _I\) will become implication biased.

**Łukasiewicz** Łukasiewicz implication likelihood (Cignoli, 2007) was defined as follows:

It is easy to calculate the gradient of this likelihood, \(\frac{\partial I_{LK}}{\partial p} = -1 \cdot {\mathbb {I}}[p>q]\). It turns out this is also implication biased.

**G**\(\ddot{\text {o}}\) **del** G\(\ddot{\text {o}}\)del implication likelihood (Paad, 2016) was defined as follows:

Also, the gradient of this likelihood \(\frac{\partial I_{G}}{\partial p} = -1 \cdot {\mathbb {I}}[p+q<1]\) indicates this operator is implication biased.

**Nilpotent** Nilpotent implication likelihood (Gerla & Rovere, 2011) was defined as follows:

Also, the gradient of this likelihood \( \frac{\partial I_{N}}{\partial p} = -1 \cdot {\mathbb {I}}[(p+q<1) \& (p >q)]\) indicates this operator is implication biased.

### 1.2 A.2 Empirical study

In this section, we present the empirical results of the above-analyzed operators for validation, with a particular emphasis on incomplete knowledge bases (especially Add-MNIST) and insufficient labeled data (especially Add-CIFAR10) scenarios. All experiments follow the same settings used in Sect. 6.1.1.

As shown in Fig. 7, implication bias significantly harms the performance of the model, especially when the knowledge base is incomplete or the amount of supervised information is not sufficient, which further supports our above analysis.

### Appendix B: Discussion about Clark’s completion

RILL and Clark’s completion both aim to reduce the uncertainty of negative information. While RILL does not alter the information in the knowledge base, but instead reduces the importance of weak samples, causing the model to pay more attention to samples that are more relevant to the given rule.

Surprisingly, explicitly applying Clark’s completion in a NeSy system may not be helpful. There are two reasons behind this claim.

First, a explicitly Clark’s completion need to replace \(\rightarrow \) to \(\leftrightarrow \). For example:

will be replaced as \(A\leftrightarrow (A_1\vee A_2\vee \cdots \vee A_n)\).

However, the fuzzy operator is unsuitable for approximating a rule with many atoms (Marra et al., 2021). An example is the n-ary Łukasiewicz strong disjunction \(F_{\vee }(x_1,\cdots ,x_n) = \min (1,x_1+\cdots +x_n)\). Although all \(x_i\) can be very small, this approximation of disjunction will give a value near 1. Because Clark’s completion will increase the number of atoms in the logical rule, the knowledge base may suffer from this problem after completion.

Second, in a NeSy system, if the knowledge base is incomplete, replace \(\rightarrow \) to \(\leftrightarrow \) will change the information in the knowledge base, which may induce wrong information. In logic programming, Clark’s completion will not change the soundness of the system, while in the NeSy setting with a data-driven approach, it may not be promised.

Here we adopt the same experimental setting of incomplete knowledge base case in Sect. 6.1.1 to validate the performance of Clark’s Completion. As depicted in Fig. 8, when the knowledge base becomes incomplete, completion of the knowledge base will not help improve the model’s performance because it introduces wrong information.

### Appendix C: Sensitivity analysis

Both \(\text {RILL}_{hinge}\) and \(\text {RILL}_{l2+hinge}\) contain a hyper-parameter \(\epsilon \) in their definition. In this section, we investigate the sensitivity of \(\epsilon \) and its impact on the model’s accuracy in the Add-MNIST experiment when the knowledge base incompleteness is 40%. Figure 9 displays the relationship between \(\epsilon \) and the model’s accuracy.

The results indicate that when the threshold \(\epsilon \) decreases to a certain value (in Fig. 9, it is 0.001), the model’s performance drops, suggesting the existence of a sensitive region that could be related to weak samples in addressing the problem of implication bias. When the threshold \(\epsilon \) is above this value, the performance of RILL is stable but slightly decreasing. This decrease may be due to the increased number of weak samples, which results in the loss of some useful information.

### Appendix D: Details of experiments

In this section, we will provide more details of our experiments.

**Implementation of Semantic Loss** The formal definition of Semantic Loss requires the conversion of CNF (conjunctive normal form) into SDD (sentential decision diagram), which can become impractical when dealing with knowledge bases that use a large number of predicates and complex rules due to the heavy computational burden. To address this issue, we propose an alternative approach where each rule in the knowledge base is converted separately into wmc (weighted model counting) models and then combined at the end. Although explicit repetitive terms are reduced during the combining phase, detecting implicit repetitive terms is computationally expensive, and thus they remain unchanged. This approach enables us to implement Semantic Loss efficiently while still accounting for the complexity of the knowledge base.

However, even with this optimization, the cost of using Semantic Loss is still much higher than that of using RILL, with a cost that is around three times higher. As a result, Semantic Loss may not be practical to use in many cases, particularly when dealing with large or complex knowledge bases.

**Details of Task 1** The backbone for both MNIST and FashionMNIST datasets is a three-layer Multilayer Perceptron (MLP) with a width of each layer being [256,512,10], and the activation function is Rectified Linear Unit (ReLU) (Maas et al., 2013). In contrast, the backbone for CIFAR10 is ResNet9 (He et al., 2016). The value of \(\lambda \) for both Fuzzy and RILL logic loss is 0.7. For Semantic Loss, we sample \(\lambda \) from \(\{0.001,0.005,0.01,0.05,0.1,0.5\}\) and choose the best one, which is 0.5 in this experiment. The learning rate is set to 0.0001, with a decay rate of 0.7. The learning rate scheduler is set to StepLR, with a decay step of 60. The optimizer used is AdamW as default, and the weight decay rate is set to 5e-4.

**Details of Task 2** For this task, we select WideResNet-28-8 (Zagoruyko & Komodakis, 2016) as the backbone architecture with two classification heads, one for *class* classification, and the other for *super-class* classification. The value of \(\lambda \) for both Fuzzy and RILL logic loss is 0.002, and for Semantic Loss, we still choose 0.5 as the default value. The learning rate is set to 0.005, with a decay rate of 0.9 and decay steps equal to 45. The learning rate scheduler is set to StepWithWarmUp, and the warm-up epoch is set as 5. The optimizer used is set as default, and momentum is set to 0.9.

## Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

## About this article

### Cite this article

He, HY., Dai, WZ. & Li, M. Reduced implication-bias logic loss for neuro-symbolic learning.
*Mach Learn* **113**, 3357–3377 (2024). https://doi.org/10.1007/s10994-023-06436-4

Received:

Revised:

Accepted:

Published:

Issue Date:

DOI: https://doi.org/10.1007/s10994-023-06436-4