# Bayesian Artificial Intelligence

## Authors

## 1 Introduction

Kevin Korb and Ann Nicholson are experienced researchers in Bayesian networks. They have contributed to the theoretical development of the field, and they have several application projects behind them. This is apparent in their textbook, Bayesian Artificial Intelligence. It is a well written introduction to the field, and it contains many useful guidelines for building Bayesian network models. You cannot be successful in this field without a good insight into the mathematical theory behind it, and the book provides a smooth and self-contained presentation.

In the preface, the authors state that the book is aimed at advanced undergraduates in computer science and those who wish to engage in pure research in Bayesian network technology. These are two kinds of readers. The first kind I shall call a* practitioner*. A practitioner is interested in learning sufficient material on the topic so as to be in a position to assist a domain expert in constructing a Bayesian network system. A* researcher*, on the other hand, is interested in an introduction to the theoretical foundations and the basic algorithms in the field of Bayesian networks.

A third possible kind of reader would be the* domain expert*. She may be an engineer, a physician, or a social scientist who wishes to learn about Bayesian networks in order to use them in her domain. The authors do not claim to have this kind of reader in mind, but I can recommend the book to them as well. For the first two kinds of readers, I would recommend two different tracks through the book. Although they are not indicated formally in the book, I shall comment separately on the tracks as I see them.

## 2 Practitioner’s track

This track consists of Chaps. 1, 2, 4, 5, 9, and 11. Following this track, the reader will get a very good introduction to Bayesian networks,* d*-separation and conditional independence, decisions, and utilities. The two most popular BN tools, Hugin and Netica, are described, and finally there is an introduction to general knowledge engineering problems with emphasis on Bayesian networks. Along the way, the reader will be presented to a rich set of applications, and furthermore, there is a good supply of exercises.

The material is well written, the selection of material is good, and it does not assume more mathematical sophistication than each issue requires. In particular, I found Chap. 9 refreshing. In this chapter, the authors put the process of building Bayesian network systems into the framework of software engineering. To my knowledge, it has not been done systematically before, and I am sure that it will be of great help for anyone engaging in a non-toy application project. I would supplement this with a discussion of *maintenance*. I think that the main reason why most Bayesian network applications never leave the laboratory, is that there is no good way of maintaining them “out there”. The worlds modelled by the networks will change, and the models have to be changed accordingly. If the only way to do this is to call for expert help and engage in an expensive new project, then the system will have no chance in a commercial setting. So far, there are no good solutions to the challenge of maintenance, but part of a solution would be* object oriented Bayesian networks*. Unfortunately, the book does not have very much to say about OOBNs.

The practitioner’s track contains many procedures helping in the construction of Bayesian networks. For example, it presents a very nice computer system for illustrating* d*-separation. This being said, from my experience in application projects, I miss hints on how, for example, to deal with problems like: you have to model arithmetic, like the sum of the states of ten variables; there are logical constraints between variables; you wish to tune the parameters of the model to particular evidence–belief patterns; you would like to know how sensible the model’s behaviour is to variation of the parameters.

## 3 Researcher’s track

The researcher’s track consists of Chaps. 1, 2, 3, 6, 7, 8, and 10. After the basic definitions, it contains a presentation of message-passing algorithms for belief updating (Kim & Pearl, junction trees, sampling methods); it has three chapters on the learning of Bayesian networks, and it has a chapter on the evaluation of Bayesian networks. The chapter on belief updating provides a solid introduction to the classical methods. New developments have taken place since these methods were constructed more than ten years ago, but as these developments are improvements of the classical methods, it is reasonable to only present the classical methods here. Also, the chapter on evaluation provides good, solid information on various methods for evaluating a Bayesian network system.

The route through learning starts with discussions on learning systems with continuous variables, namely linear models. This is a confusing departure from the rest of the book. The concepts introduced in the book have not been defined for continuous variables, and it may be difficult for the reader to establish the connection from linear models to Bayesian networks.

The presentation of the learning algorithms consists of two interleaved topics: how they work, and how they can be interpreted as discovering causality. The presentation of how they work is good and fairly easy to follow. However, the interleaving with the second topic makes it hard reading. Furthermore, the quality of writing seems to decrease in these chapters. The algorithms presented in the book can be used no matter the interpretation of the results, and I would prefer a presentation with the two topics separated. As it is now, a reader who is not prepared for the ongoing scientific debate on how to learn causality from non-experimental data gets confused and diverted. I expand on this below.

The algorithms for structural learning are* conditional independence learners* (CI and PC) and* scoring based learners*, where various scoring metrics and search methods are presented. The algorithms for learning parameters are Bayesian methods over a conjugate family of distributions, and for missing values, Gibbs sampling as well as EM.

I have a single objection with respect to the assumptions behind the methods for learning parameters. One of them is* parameter-independence across non-local states*. The authors write that this assumption is already guaranteed by the Markov property assumed as a basic practice for the Bayesian network as a whole. I do not agree. In Bayesian networks with several identical objects (e.g. dynamic Bayesian networks), parameter learning is not local. Actually, non-locality is exploited in learning algorithms for OOBNs.

## 4 Do the learning algorithms discover causality?

In the preface, the authors state their basic approach for their presentation of Bayesian networks, namely that the networks represent causal relations. I fully agree, but for many other researchers—particularly statisticians—Bayesian networks should, rather, be considered as a compact representation of a joint distribution, and what matters are the conditional independence properties. This disagreement is highly important in connection to learning: I present the result of a learning algorithm for a domain expert. She says “Gee, this link from A to B is very interesting.” How shall I react?

Korb and Nicholson argue that the CI (and PC) algorithm actually discovers causality. They do not claim that all directed links should be interpreted as causal links, but a well-defined set of links qualify for this interpretation. In Pearl [1], there is a series of precisely specified assumptions and definitions leading to this conclusion. Korb and Nicholson do present parts of Pearl’s argument, but there are rather many holes, and the reader does not really have a chance to see the limitations of the algorithms with respect to discovering causality. For example, one of the assumptions in Pearl’s argument is that causality is deterministic, and uncertainty on causal impact is due to non-observed variables, which are abstracted away. It is questionable whether this assumption holds for social sciences.

I am not sure that a textbook like this should put so much effort into convincing the reader that the learning algorithms actually discover causality. In practice, they too often don’t. There is uncertainty on the discovered conditional independencies, and there may be latent variables. I agree that it is a substantial scientific achievement there being algorithms which can detect causality from non-experimental data, but it is a mistake to let the reader believe that if he uses one of these algorithms on a data set, then, for example, the* v*-structures reflect causal dependence. Such a conclusion is much more subtle, and needs more background than can be given in an introductory textbook. A reader of this book may present a learned network to a physician or a social scientist, and they will, in good faith, find strong support for wrong causalities. This reservation of mine is even stronger with respect to the scoring-based algorithms. A discussion of how they discover causality should take place in research literature, and a textbook should limit itself to a section referring to the discussion.

I do not wish to finish this review with critical points, as it may leave a wrong impression. The book targets advanced undergraduates in computer science and it hits them pretty precisely. It will serve perfectly as a textbook for a course, as well as for self-study.