next up previous
Next: Restricted Acyclic Partially Directed Up: Searching for Bayesian Network Previous: Introduction

DAGs and Equivalence Classes of DAGs

The search procedures used within Bayesian network learning algorithms usually operate on the space of DAGs. In this context, the problem can be formally expressed as: Given a complete training set (i.e. we do not consider missing values or latent variables) $D=\{{\bf u}^1,\ldots , {\bf u}^m\}$ of instances of a finite set of $n$ variables, ${\mathcal{U}}$, find the DAG $H^*$ such that

H^*=\arg\max_{H \in {\mathcal H}_n} g(H:D)
\end{displaymath} (1)

where $g(H:D)$ is a scoring function measuring the fitness of any candidate DAG $H$ to the dataset $D$, and ${\mathcal H}_n$ is the family of all the DAGs with $n$ nodes2.

Many of the search procedures, including the commonly used local search methods, rely on a neighborhood structure that defines the local rules (operators) used to move within the search space. The standard neighborhood in the space of DAGs uses the operators of arc addition, arc deletion and arc reversal, thereby avoiding (in the first and the third case) the inclusion of directed cycles in the graph.

The algorithms that search in the space of DAGs using local methods are efficient mainly because of the decomposability property that many scoring functions exhibit. A scoring function $g$ is said to be decomposable if the score of any Bayesian network structure may be expressed as a product (or a sum in the log-space) of local scores involving only one node and its parents:

$\displaystyle g(H:D) = \sum_{y\in {\mathcal{U}}} g_D(y,Pa_H(y))$     (2)
$\displaystyle g_D(y,Pa_H(y)) = g(y,Pa_H(y):N_{y,Pa_H(y)})$     (3)

where $N_{y,Pa_H(y)}$ are the statistics of the variables $y$ and $Pa_H(y)$ in $D$, i.e. the number of instances in $D$ that match each possible instantiation of $y$ and $Pa_H(y)$. $Pa_H(y)$ will denote the parent set of $y$ in the DAG $H$, i.e. $Pa_H(y)=\{t\in {\cal U}  \vert  t \rightarrow y \in H\}$.

A procedure that changes one arc at each move can efficiently evaluate the improvement obtained by this change. Such a procedure can reuse the computations carried out at previous stages, and only the statistics corresponding to the variables whose parent sets have been modified need to be recomputed. The addition or deletion of an arc $x\rightarrow y$ in a DAG $H$ can therefore be evaluated by computing only one new local score, $g_D(y,Pa_{H\cup\{x\}}(y))$ or $g_D(y,Pa_{H\setminus\{x\}}(y))$, respectively. The evaluation of the reversal of an arc $x\rightarrow y$ requires the computation of two new local scores, $g_D(y,Pa_{H\setminus\{x\}}(y))$ and $g_D(x,Pa_{H\cup\{y\}}(x))$.

It should be noted that each structure in the DAG space is not always different from the others in terms of its representation capability: if we interpret the arcs in a DAG as causal interactions between variables, then each DAG represents a different model; however, if we see a DAG as a set of dependence/independence relationships between variables (that permits us to factorize a joint probability distribution), then different DAGs may represent the same model. Even in the case of using a causal interpretation, if we use observation-only data (as opposed to experimental data where some variables may be manipulated), it is quite common for two Bayesian networks to be empirically indistinguishable. When two DAGs $H$ and $H'$ can represent the same set of conditional independence assertions, we say that these DAGs are equivalent3 [59], and we denote this as $H\simeq H'$.

When learning Bayesian networks from data using scoring functions, two different (but equivalent) DAGs may be indistinguishable, due to the existence of invariant properties on equivalent DAGs, yielding equal scores. We could take advantage of this in order to get a more reduced space of structures to be explored.

The following theorem provides a graphical criterion for determining the equivalence of two DAGs:

Theorem 1 [59]   Two DAGs are equivalent if and only if they have the same skeleton and the same v-structures.

The skeleton of a DAG is the undirected graph that results from ignoring the directionality of every edge. A v-structure in a DAG $H$ is an ordered triplet of nodes, $(x,y,z)$, such that (1) $H$ contains the arcs $x\rightarrow y$ and $y \leftarrow z$, and (2) the nodes $x$ and $z$ are not adjacent in $H$. A head-to-head pattern (shortened h-h) in a DAG $H$ is an ordered triplet of nodes, $(x,y,z)$, such that $H$ contains the arcs $x\rightarrow y$ and $y \leftarrow z$. Note that in an h-h pattern $(x,y,z)$ the nodes $x$ and $z$ can be adjacent.

Another characterization of equivalent DAGs was presented by [15], together with proof that several scoring functions used for learning Bayesian networks from data give the same score to equivalent structures (such functions are called score equivalent functions).

The concept of equivalence of DAGs partitions the space of DAGs into a set of equivalence classes. Whenever a score equivalent function is used, it seems natural to search for the best configuration in this new space of equivalence classes of DAGs. This change in the search space may bring several advantages:

The disadvantages are that, in this space of equivalence classes, it is more expensive to generate neighboring configurations, because we may be forced to perform some kind of consistency check, in order to ensure that these configurations represent equivalence classes4; in addition, the evaluation of the neighboring configurations may also be more expensive if we are not able to take advantage of the decomposability property of the scoring function. Finally, the new search space might introduce new local maxima that are not present in DAG space.

In order to design an exploring process for the space of equivalence classes we could use two distinct approaches: the first consists in considering that an equivalence class is represented by any of its components (in this case, it is necessary to avoid evaluating more than one component per class); and the second consists in using a canonical representation scheme for the classes.

The graphical objects commonly used to represent equivalence classes of DAGs are acyclic partially directed graphs [59] (known as PDAGs). These graphs contain both directed (arcs) and undirected (links) edges, but no directed cycles. Given a PDAG $G$ defined on a finite set of nodes ${\cal U}$ and a node $y\in {\cal U}$, the following subsets of nodes are defined:

An arc $x\rightarrow y$ in a DAG $H$ is compelled if it appears in every DAG belonging to the same equivalence class as $H$. An arc $x\rightarrow y$ in $H$ is said to be reversible if it is not compelled, i.e. there is a DAG $H'$ equivalent to $H$ that contains the arc $x\leftarrow y$. As every DAG in a particular equivalence class has the same set of compelled and reversible arcs, a canonical representation of an equivalence class is the PDAG consisting of an arc for every compelled arc in the equivalence class, and a link for every reversible arc. This kind of representation has been given several names: pattern [67], completed PDAG (CPDAG) [16], essential graph [4,21]. As a consequence of theorem 1, a completed PDAG possesses an arc $x\rightarrow y$ if and only if a triplet of nodes $(x,y,z)$ forms a v-structure or the arc $x\rightarrow y$ is required to be directed due to other v-structures (to avoid forming a new v-structure or creating a directed cycle) (see Figure 1).

Figure 1: (a) Dag and (b) completed PDAG. The arcs $z\rightarrow x$, $y\rightarrow x$ and $x\rightarrow u$ are compelled; the arc $y\rightarrow w$ is reversible

Note that an arbitrary PDAG does not necessarily represent some equivalence class of DAGs, although there is a one-to-one correspondence between completed PDAGs and equivalence classes of DAGs. Nevertheless, completed PDAGs are considerably more complicated than general PDAGs. A characterization of the specific properties that a PDAG must verify in order to be a completed PDAG was obtained by [4]:

Theorem 2 [4]   A PDAG $G$ is a completed PDAG if and only if it satisfies the following conditions:
$G$ is a chain graph, i.e. it contains no (partially) directed cycles.
The subgraph induced by every chain component5 of $G$ is chordal (i.e. on every undirected cycle of length greater than or equal to 4 there are two non-consecutive nodes connected by a link).
The configuration $x\rightarrow y\mbox{-}z$ does not occur as an induced subgraph of $G$.
Every arc $x\rightarrow y\in G$ occurs in at least one of the four configurations displayed in Figure 2 as an induced subgraph of $G$.

Figure 2: The four different configurations containing an arc $x\rightarrow y$ in a completed PDAG

Let us illustrate the advantages of searching in the space of equivalence classes of DAGs rather than the space of DAGs with a simple example. Figure 3 displays the set of possible DAGs involving three nodes $\{x,y,z\}$, with arcs between $z$ and $x$, and between $y$ and $x$. The first three DAGs are equivalent. In terms of independence information, they lead to the same independence statement $I(y,z\vert x)$ ($y$ and $z$ are conditionally independent given $x$), whereas the statement $I(y,z\vert\emptyset)$ ($y$ and $z$ are marginally independent) corresponds to the fourth one. The four DAGs may be summarized by only two different completed PDAGs, shown in Figure 4.

Figure 3: Four different DAGs with three nodes and two arcs

Figure 4: Two different equivalence classes of DAGs

As we can see, the search space may be reduced by using PDAGs to represent the classes: in our example, to two classes instead of four configurations; it can be seen [4] that the ratio of the number of DAGs to the number of classes is $25 / 11$ for three nodes, $543 / 185$ for four nodes and $29281 / 8792$ for five nodes; in more general terms, the results obtained by [39] indicate that this ratio approaches a value of less than four as the number of nodes increases. The use of equivalence classes therefore entails convenient savings in exploration and evaluation effort, although the gain is not spectacular.

On the other hand, the use of a canonical representation scheme allows us to explore the space progressively and systematically, without losing any unexplored configuration unnecessarily. Returning to our example, let us suppose that the true model is the DAG displayed in Figure 4.b and we start the search with an empty graph (with no arcs). Let us also assume that the search algorithm identifies that an edge between $x$ and $y$ produces the greatest improvement in the score. At this moment, the two alternatives, $x\rightarrow y$ and $x\leftarrow y$ (case 1 and case 2 in Figure 5, respectively), are equivalent. Let us now suppose that we decide to connect the nodes $x$ and $z$; again we have two options: $x\rightarrow z$ or $x\leftarrow z$. Nevertheless, depending on the previous selected configuration, we obtain different outcomes that are no longer equivalent (see Figure 5).

If we had chosen case 1 (thus obtaining either case 1.1 or case 1.2), we would have eliminated the possibility of exploring the DAG $z\rightarrow x\leftarrow y$, and therefore the exploring process would have been trimmed. As the true model is precisely this DAG (case 2.1 in Figure 5), then the search process would have to include another arc connecting $y$ and $z$ (cases 1.1.1, 1.2.1 or 1.2.2), because $y$ and $z$ are conditionally dependent given $x$. At this moment, any local search process would stop (in a local optimum), because every local change (arc reversal or arc removal) would make the score worse.

Figure 5: Local search in the space of DAGs is trapped at a local optimum

Consequently, our purpose consists in adding or removing edges (either links or arcs) to the structure without pruning the search space unnecessarily. We could therefore introduce links instead of arcs (when there is not enough information to distinguish between different patterns of arcs), which would serve as templates or dynamic linkers to equivalence patterns. They represent any valid combination of arcs which results in a DAG belonging to the same equivalence class.

Looking again at the previous example, we would proceed as follows: assuming that in our search space the operators of link addition and creation of h-h patterns are available, we would first include the link $x\mbox{-}y$; secondly, when considering the inclusion of a connection between $x$ and $z$, we would have two options, shown in Figure 4: the h-h pattern $z\rightarrow x\leftarrow y$ and the pattern $z \mbox{-}x \mbox{-}y$. In this case the scoring function would assign the greatest score to the h-h pattern $z\rightarrow x\leftarrow y$, thus obtaining the correct DAG.

next up previous
Next: Restricted Acyclic Partially Directed Up: Searching for Bayesian Network Previous: Introduction
Luis Miguel de Campos Ibáñez 2003-05-30