
 
% Sample LaTeX file for creating a paper in the Morgan Kaufmannn two 
% column, 8 1/2 by 11 inch proceedings format. 

% notation summary, put here for internal consistency:
% \vec{X}: all variables in the domain
% \vec{S}: a subset of variables in the domain
% \vec{x}: a particular assignment to all variables in the domain
% \vec{s}: particular assignment to a subset
% i: index over variables or sets of variables
% j: index over rows
% N: the number of variables
% R: the number of rows
% k: index over center numbers
% M: the number of centers

 
\documentstyle[proceed,epsf,psfig]{article} 
 
\title{Mix-nets: Factored Mixtures Of Gaussians in Bayesian Networks with
Mixed Discrete And Variables} 
 
\author{ {\bf Scott Davies} \\  
School of Computer Science \\  
Carnegie Mellon University\\ 
Pittsburgh, PA 15213 \\ 
scottd@cs.cmu.edu \\
\And 
{\bf Andrew Moore} \\  
School of Computer Science \\  
Carnegie Mellon University\\ 
Pittsburgh, PA 15213 \\ 
awm@cs.cmu.edu \\
} 
 
\begin{document} 
 
\maketitle 
 
\begin{abstract} 

Recently developed techniques have made it possible to quickly learn
accurate probability density functions from data in low-dimensional
continuous spaces.  In particular, mixtures of Gaussians can be fitted
to data very quickly using an accelerated EM algorithm that employs
multiresolution $k$d-trees~\cite{awm:vfembmmc}.  However, these
techniques do not scale to large numbers of variables.  In this paper,
we propose a new kind of Bayesian network in which low-dimensional
mixtures of Gaussians over different subsets of the domain's variables
are combined into a coherent joint probability model over the the
entire domain.  The network is also capable of modelling complex
dependencies between discrete variables and continuous variables
without requiring discretization of the continuous variables.  We
present efficient heuristic algorithms for automatically learning
these networks from data, and perform comparative experiments
illustrating how well these networks model real scientific data.  We
also briefly discuss numerous possible improvements to the networks,
as well as their possible application to anomaly detection,
classification, probabilistic inference, and compression.

\end{abstract} 

\newcommand{\onefig}[3]{\begin{figure}[htb]
\begin{center}
\begin{minipage}{0.35\textwidth}
\centerline{\psfig{file=#1,width=\textwidth}}
\end{minipage}
\begin{minipage}{0.35\textwidth}
{\footnotesize
\refstepcounter{figure}
\label{#2}
Figure~\ref{#2}: #3}
\end{minipage}
\end{center}
\end{figure}}
 
\section{Introduction}

% Motivate pdf estimation.  Mention existing mixtures-of-gaussians-type
% stuff.  Mention why they break.  (Figure of evil bimodal independent
% variables example?)  Talk about existing methods for handling
% continuous variables in Bayes nets, and why none of them both scale and
% work well.  Then there's the other problem of discrete variables.
% Uh oh!  But here we come, to the rescue...

%Note: must mention Driver \& Morrell 95!

Bayesian networks (otherwise known as belief networks) are a popular
method of representing joint probability distributions over many
variables~\cite{pearl:priisnpi}.  A Bayesian network contains a
directed acyclic graph $G$ with one vertex $V_i$ in the graph for each
variable $X_i$ in the domain.  The directed edges in the graph specify
a set of independence relationships between the variables.  Define
$\Pi_i$ to be the set of variables whose nodes in the graph are
``parents'' of $V_i$.  The set of independence relationships specified
by a given graph is then as follows: given the values of $\Pi_i$ but no
other information, $X_i$ is conditionally independent of all variables
corresponding to nodes that are not $V_i$'s descendants in the graph.
This set of independence relationships allows us to decompose the
joint probability distribution $P(\vec{X})$ in the following manner:
\[P(\vec{X}) = \prod_{i=1}^{N} {P(X_i | \Pi_i)},\]
where $N$ is the number of variables in the domain.  Thus, if in
addition to $G$ we also specify $P(X_i | \Pi_i)$ for every variable
$X_i$, then we have specified a valid probability distribution
$P(\vec{X})$ over the entire domain.

Bayesian networks are most commonly used in situations where all the
variables are discrete, largely because it is difficult to model
complex probability densities over continuous variables, and even
harder to model interactions between continuous and discrete
variables.  When Bayesian networks with continuous variables are used,
the continuous variables are usually modelled with very simple
parametric forms such as multidimensional Gaussians.  Some researchers
have recently investigated the use of more complicated continuous
distributions within Bayesian networks; for example, weighted sums of
Gaussians have been used to approximate conditional probability
density functions~\cite{drivmorr:iocbnusowg}.  Unfortunately, such
complex distributions over continuous variables are usually quite
computationally expensive to learn.  If an appropriate Bayesian
network structure is known beforehand, then this expense may not be too
problematic, since only $N$ conditional distributions must be learned.
On the other hand, if the
dependencies in the data are not known {\em a priori} and the
structure must be learned from the data, then the number of
conditional distributions that must be learned and tested while a
structure-learning algorithm searches for a good network can become
unmanagably large.

However, very fast algorithms for generating complex joint probability
densities over small sets of continuous variables have recently been
developed~\cite{awm:vfembmmc}.  This paper investigates how these
algorithms can be employed to learn Bayesian networks over many
variables, each of which can be either continuous or discrete.  In
section~\ref{mixnetsect}, we describe the type of parameterizations
employed in our networks' nodes, and how they are learned from data
given a fixed Bayesian network structure.  In
section~\ref{structlearnsect}, we describe an algorithm for
automatically learning the structures of our Bayesian networks from
data.  In section~\ref{experimentsect}, we provide experimental
results illustrating the effectiveness of our methods on two real
scientific datasets.  In section~\ref{appsect} we discuss possible
applications, and finally in section~\ref{conclsect} we discuss a few
possible lines of research for improvements.

\section{Mix-nets}
\label{mixnetsect}

\subsection{General methodology}

Suppose that we have a very fast, black-box algorithm $A$ geared
not towards finding accurate conditional models of the form $P_i(X_i |
\Pi_i)$, but rather towards finding accurate {\em joint} probability
models $P_i(\vec{S_i})$ over subsets of variables $\vec{S_i}$, such as
$P_i(X_i, \Pi_i)$.  Furthermore, suppose it is only capable of generating
joint models for relatively small subsets of the variables, and that the
models returned for different subsets of variables are not necessarily
consistent.  For example, if we were to ask $A$ for two different
models $P_1(X_1, X_{17})$ and $P_2(X_1, X_{24})$, the marginal
distributions $P_1(X_1)$ and $P_2(X_1)$ of these models may be
inconsistent.  Can we still combine many different models generated
by $A$ into a valid probability distribution over the entire space?

Fortunately, the answer is yes, as long as the models returned by $A$
can be marginalized exactly.  If for any given $P_i(X_i,
\Pi_i)$ we can compute a marginal distribution $P_i(\Pi_i)$ that is
consistent with it, then we can employ $P_i$ as a conditional distribution
$P_i(X_i | \Pi_i)$ trivially as follows:
\[P_i(X_i | \Pi_i) = P_i(X_i, \Pi_i) / P_i(\Pi_i). \]
In this case, given a directed acyclic graph $G$ specifying a Bayesian
network structure over $N$ continuous variables, we can simply use $A$ to acquire $N$ models $P(X_i,
\Pi_i)$, marginalize these models, and string them together to form a
probability distribution over the entire space:
\[P(\vec{X}) = \prod_{i=1}^N {P_i(X_i, \Pi_i) / P_i(\Pi_i)} \]
A simple but key observation is that even though the marginals of
different $P_i$'s may be inconsistent with each other, the $P_i$'s are
only {\em used} conditionally, and in a manner that prevents these
inconsistencies from actually causing the overall model to become
inconsistent.  Of course, the fact that there are inconsistencies at
all --- suppressed or not --- means that there are redundant
parameters in the overall model.  However, if allowing such redundancy
lets us employ a particularly fast and effective model-learning
algorithm $A$, it may be worth it.

\subsection{Handling Continuous Variables}

Suppose for the moment that $\vec{X}$ contains only continuous values.
What sorts of models might we want $A$ to return?  One powerful type
of model for representing probability density functions over small
sets of variables is a {\em Gaussian mixture model}
(see e.g. \cite{dudahart:pcasa}).  Let $\vec{s_j}$ represent the values
that the $j^{th}$ datapoint in the dataset $D$ assigns to the variable set of
interest $\vec{S_i}$.  In a Gaussian mixture model over $\vec{S_i}$, we assume that
the data are generated independently through the following process:
for each $\vec{s_j}$ in turn, nature begins by randomly picking a class,
$c_k$, from a discrete set of classes ${c_1, \ldots, c_M}$.  Then
nature draws $\vec{s_j}$ from a multidimensional Gaussian whose mean $\mu_k$
and covariance $\Sigma_k$ depend on the class.  This produces a
distribution of the following mathematical form:

{\small {\[P(\vec{S_i} | \vec{\theta}) = \sum_{k=1}^{M} \alpha_k (2\pi ||\Sigma_k||)^{-\frac{1}{2}} exp(-\frac{1}{2}(\vec{S_i} - \vec{\mu_k})^{T} \Sigma_{k}^{-1} (\vec{S_i} - \vec{\mu_k})) \]}}

where $\alpha_k$ represents
the probability of a point coming from the $k^{th}$ class, and 
\[\vec{\theta}^T = \{\alpha_1, \ldots, \alpha_M, \mu_1, \ldots, \mu_M, \Sigma_1, \ldots, \Sigma_M \} \]
denotes the entire set of the mixture's parameters.  It is possible to
model any continuous probability distribution with arbitrary accuracy
by using a Gaussian mixture with a sufficiently large $M$.

Given a Gaussian mixture model $P_i(X_i, \Pi_i)$, it is easy to
compute the marginalization $P_i(\Pi_i)$: the marginal mixture has the
same number of Gaussians as the original mixture, with the same
$\alpha$'s.  The means and covariances of the marginal mixture are
simply the means and covariances of the original mixture with all
elements corresponding to the variable $X_i$ removed.  
%The inverses
%and determinants of the new marginal's covariance matrices must then
%be recomputed.  Note that these computations need only be done
%once --- we do not need to reevaluate the inverse or the determinant
%every time we wish to use the marginal distribution.
Thus, Gaussian mixture models are suitable for combining into global
joint probability density functions using the methodology described
above, assuming all variables in the domain are continuous.  This is
the class of models we employ in this paper, although many other
classes may be used in an analogous fashion.  

\subsubsection{Learning Gaussian mixtures from data}

The EM algorithm is a popular method for learning mixture
models from data.  The basic EM algorithm for Gaussian mixture models
begins with some fixed number of Gaussians $M$ and some arbitrary
initial values for the parameters in the mixture.  It then iterates
the following two steps repeatedly:
\begin {itemize}
\item Given the current network parameters $\vec{\theta}$, for each datapoint $\vec{s_j}$ and each class $c_k$, calculate 
the extent $w_{jk}$ to which class $c_k$ ``owns'' $\vec{s_j}$: $w_{jk} = P(c_k | s_j, \vec{\theta})$.
\item Adjust $\vec{\theta}$ as follows:
\[\alpha_k = \frac{sw_k}{R}, \mu_k = \frac{1}{sw_k}\sum_{j=1}^{R} w_{jk}\vec{s_j},\] 
\[\Sigma_j = \frac{1}{sw_k}\sum_{j=1}^{R} w_{jk}(\vec{s_j}-\mu_k)(\vec{s_j}-\mu_k)^T\]
where $sw_k = \sum_{j=1}^{R} w_{jk}$.
\end {itemize}

Each iteration of this algorithm increases the overall likelihood of
the data.  
%(See (ref?) for details.)
Unfortunately, each iteration of
this basic algorithm is slow, since it requires a entire pass through
the data.  Instead, we use an accelerated EM algorithm in which
multiresolution $k$d-trees~\cite{awm:elwpr} are used to dramatically
reduce the computational cost of each iteration~\cite{awm:vfembmmc}.
We refer the interested reader to this previous paper~\cite{awm:vfembmmc}
for details.

An important remaining issue is how to choose the appropriate number of Gaussians, $M$, for the mixture.  If we restrict ourselves to too few
Gaussians, we will fail to model the data accurately; on the other hand,
if we allow ourselves too many, then we will ``overfit'' the data
and our model will generalize poorly.  A popular way of dealing with
this tradeoff is to choose the model maximizing a ``scoring function''
that includes penalty terms related to the number of parameters in
the model.  We employ the Bayesian Information Criterion, or BIC,
to choose between mixtures with different numbers of Gaussians.
The BIC score for a given probability model $P'(\vec{S})$ is as follows:

\[BIC(P') = \log P'(D_S) - \frac{\log(R)}{2} |P'|\]

where $D_S$ is the dataset $D$ restricted to the variables of interest
$\vec{S}$, and $|P'|$ is the number of parameters in $P'$.  

Rather than re-run the EM algorithm to convergence for many different
choices of $M$ and choosing the resulting mixture that maximizes the
BIC score, we use a heuristic algorithm that starts with a small
number of Gaussians and stochastically tries adding or deleting
Gaussians.  Gaussians with high overall probabilities are sometimes
each split into two Gaussians, and Gaussians with low overall
probabilities are sometimes eliminated.  After the number of Gaussians
is changed in this fashion, the EM algorithm is run for a few more
iterations.  If the resulting mixture has a higher BIC score than 
the BIC score of the mixture with the previous number of Gaussians,
then the algorithm continues; otherwise it resets its distribution
back to the mixture with the previous number of Gaussians, runs 
the EM algorithm for a few more iterations, and then continues stochastically
from there.  (Limited space precludes us from going into the gory details.)

\subsection{Handling Discrete Variables}

Suppose now that the set of variables $\vec{S_i}$ we wish to model
includes discrete variables as well as continuous variables.  Let
$\vec{Q_i}$ be the discrete variables in $\vec{S_i}$, and $\vec{C_i}$ the
continuous variables in $\vec{S_i}$.  One simple model for $P_i(\vec{Q_i},
\vec{C_i})$ is a lookup table with an entry for each possible set of
assignments to $\vec{Q_i}$.  The entry in the table corresponding to a
particular set of assignments to $\vec{Q_i}$ contains two things: the
marginal distribution $P_i(\vec{Q_i})$, and a Gaussian mixture modelling
the conditional distribution $P_i(\vec{C_i} | \vec{Q_i})$.  Let us refer to
tables of this form as {\em mixture tables}.  We obtain each of the
mixture table's estimates for $P_i(\vec{Q_i})$ directly from the data: it
is simply the fraction of the rows in the dataset that assign the
appropriate values to the variables in $\vec{Q_i}$.  Given an algorithm
$A$ for learning Gaussian mixtures from continuous data, we use it to
generate the conditional distributions $P_i(\vec{C_i} | \vec{Q_i})$ in the
mixture table by calling it once for every possible set of assignments
to $\vec{Q_i}$; each time, we only provide $A$ with the subset of the
dataset $D$ that is consistent with those assignments.

Suppose now that we are given a Bayesian network structure over the
entire set of variables, and for each variable $X_i$ we are given a
mixture table for $P_i(\vec{S_i})) = P_i(X_i, \Pi_i)$.  We must now calculate new
mixture tables for each of the marginal distributions $P_i(\Pi_i)$ so
that we can use them for the conditional distributions $P_i(X_i | \Pi_i)
= P_i(X_i, \Pi_i) / P_i(\Pi_i)$.  Let $\vec{C}$ represent the continuous variables
in $X_i \cup \Pi_i$; $\vec{Q}$ represent the discrete variables in
$X_i \cup \Pi_i$; $\vec{C_{\Pi_i}}$ represent the continuous variables
in $\Pi_i$; and $\vec{Q_{\Pi_i}}$ represent the discrete variables in
$\Pi_i$.  (Either $\vec{Q_{\Pi_i}} = \vec{Q_i}$ or $\vec{C_{\Pi_i}} =
\vec{C_i}$, depending on whether $X_i$ is continuous or discrete.)

If $X_i$ is continuous, then the marginalized mixture table for
$P(\Pi_i)$ has the same number as entries as the original table for
$P_i(X_i, \Pi_i)$, and the estimates for $\vec{Q_i}$ in
the marginalized table are the same as in the original table.  For
each combination of assignments to $\vec{Q_i}$, we simply marginalize
the appropriate Gaussian mixture in the original table, $P(\vec{C_i} |
\vec{Q_i})$ to a new mixture $P(\vec{C_{\Pi_i}} | \vec{Q_i})$ and use
this new mixture in the corresponding spot in the marginalized table.

If $X_i$ is discrete, then for each combination of assignments
to $\vec{Q_{\Pi_i}}$, we combine several different
Gaussian mixtures for various $P_i(\vec{C_{\Pi_i}} | \vec{Q_i})$'s 
into a new Gaussian mixture for $P_i(\vec{C_{\Pi_i}} | \vec{Q_{\Pi_i}})$.
We do this using the following relationship:
\[P_i(\vec{C_{\Pi_i}} | \vec{Q_{\Pi_i}}) = \sum_{X_i}
P_i(X_i | \vec{Q_{\Pi_i}}) P_i(\vec{C_{\Pi_i}} | \vec{Q_i}) \]
The values of $P(\vec{Q_{\Pi_i}})$ in the marginalized table are computed
trivially from the original table as $P_i(\vec{Q_{\Pi_i}}) = \sum_{X_i}
P_i(X_i, \vec{Q_{\Pi_i}})$.

We have now described the steps necessary to use mixture tables in
order to parameterize Bayesian networks over domains with both
discrete and continuous variables.  Note that mixture tables are not
particularly well-suited for dealing with discrete variables that can
take on many possible values, or for situations involving many
dependent discrete variables --- in such situations, the data will be
shattered into many separate Gaussian mixtures, each of has little
support from the data.  Better ways of dealing with discrete variables
are undoubtedly possible, but we leave them for future research.
(We will briefly discuss how we currently handle mixture tables' problems with
sparse data in our experimental results section.)

\section{Learning mix-nets' structure}
\label{structlearnsect}

Given a Bayesian network structure over a domain with both discrete
and continuous variables, we now know how to learn mixture tables
describing each variable's joint probability function with its parents
in the network, and how to marginalize these mixture tables to obtain
the conditional distribution needed to compute a coherent probability
function over the entire domain.  But what if we don't know {\it a
priori} what dependencies exist between the variables in the domain
--- can we learn these dependencies automatically and find the
best Bayesian network structure on our own, or at least a good network
structure?

In general, finding the optimal Bayesian network structure with which
to model a given dataset is NP-complete~\cite{chickering:npcomp}, even
when all the data is discrete and there are no missing values or
hidden variables.  A popular heuristic approach to finding networks
that model discrete data well is to hillclimb over network structures,
using a scoring function such as the BIC as the criteria to maximize.
Unfortunately, hillclimbing usually requires scoring a very large
number of networks.  While our algorithm for learning Gaussian
mixtures from data is comparitively fast for the complex task it
performs, it is still too expensive to re-run on hundreds of thousands
of different variable subsets that would be necessary to
parameterize all the networks tested over an extensive hillclimbing
run.

However, there are other heuristic algorithms that often find networks
very close in quality to those found by hillclimbing but with much
less work.  A frequently used class of algorithms involves measuring
all pairwise interactions between the variables, and then greedily
constructing a network that models the strongest of these pairwise
interactions.  We use such an algorithm in this paper to automatically
learn structures of Bayesian networks.

In order to measure the pairwise interactions between the variables,
we start with an empty Bayesian network $B_{\epsilon}$ in which there
are no arcs, i.e., all variables are assumed to be independent.  We
use our mixture-learning algorithm to calculate the parameters in
this empty network, and then calculate this network's BIC score.  The
BIC score of a given Bayesian network $B$ is simply the log-likelihood
of the dataset $D$ given the network, minus a penalty term
proportional to the number of parameters in the network:

\[BIC(B) = \log P(D|B) - \frac{\log(R)}{2} |B|,\]

where $|B|$ is the number of parameters in the entire network $B$.
The number of parameters in the network is equal to the sum of
the parameters in each network node, where the parameters of
a node for variable $X_i$ are the parameters of $P_i(X_i, \Pi)$.

Once we have calculated the BIC score of the empty network
$B_{\epsilon}$, we calculate the BIC score of every possible Bayesian
network containing exactly one arc.  With $N$ variables, there are $N
\choose 2$ or $O(N^2)$ such networks.  Let $B_{ij}$ denote the network with a
single arc from $X_i$ to $X_j$.  Note that to compute the BIC score of
$B_{ij}$, we need not recompute the mixture tables for any variable
other than $X_j$, since the others can simply be copied from $B_{\epsilon}$.
Now, define $I(X_i, X_j)$, the ``importance'' of the dependency between
variable $X_i$ and $X_j$, to be $BIC(B_{ij}) - BIC(B_{\epsilon})$.

After computing all the $I(X_i, X_j)$'s, we initialize our current
working network $B$ to the empty network $B_{\epsilon}$, and then
greedily add arcs to $B$ using the $I(X_i, X_j)$'s as hints for what
arcs to try adding next.  At any given point in the algorithm, the set
of variables is split into two mutually exclusive subsets, {\it DONE} and
{\it PENDING}.  All variables begin in the {\it PENDING} set.  Our algorithm
proceeds by selecting a variable in the {\it PENDING} set, adding arcs to
that variable from other variables in the {\it DONE} set, and then moving
the variable to the {\it DONE} set.  High-level pseudo-code for the 
algorithm appears in Figure~\ref{greedyalg}.

\begin{figure}
\begin{itemize}
\item $B := B_{\epsilon}$
\item While there are still variables in {\it PENDING}:
\begin{itemize}
\item Consider all pairs of variables $X_d$ and $X_p$ such that $X_d$
is in {\it DONE} and $X_p$ is in $PENDING.^{\dagger}$ Of these, let
$X_d^{max}$ and $X_p^{max}$ be the pair of variables that maximizes
$I(X_d, X_p)$.  Our algorithm selects $X_p^{max}$ as the next variable
to consider adding arcs to.
\item Let $X_d^1, X_d^2, \ldots X_d^K$ denote the $K$ variables in 
{\it DONE} with the highest values of $I(X_d, X_p^{max})$, in descending
order of $I(X_d, X_p^{max})$.
\item For $i = 1$ to $K$:
\begin{itemize}
\item If $X_p^{max}$ now has {\it MAXPARS} parents in $B$, or if 
$I(X_d^i, X_p^{max})$ is less than zero, break
out of the for loop over $i$ and do not consider adding any more
parents to $X_p^{max}$.
\item Let $B'$ be a network identical to $B$ except with an additional
arc from $X_d^i$ to $X_p^{max}$.  Call our mixture-learning algorithm
to update the parameters for $X_p^{max}$'s node in $B'$, and compute
$BIC(B')$.
\item If $BIC(B') > BIC(B), B := B'$.
\end{itemize}
\item Move $X_p^{max}$ from {\it PENDING} to {\it DONE}.
\end{itemize}
\end{itemize}
\label{greedyalg}
\caption{Our greedy network structure learning algorithm.}
\end{figure}

The algorithm calls our mixture-learning algorithm $O(N^2)$ times on
mixtures of two variables in order to calculate all the pairwise
dependency strengths $I(X_i, X_j)$, and then $O(N*K)$ more times on
mixtures of $MAXPARS + 1$ or fewer variables.  

%Note that as the algorithm is described above, the step in the
%algorithm labeled with a ``$^{\dagger}$'' might appear to take
%$O(N^2)$ time, thus bumping the overall time of the algorithm up to
%$O(N^3)$.  By caching information between iterations, the cost of this
%step per iteration could be reduced to $O(N \log K)$, for a total cost
%of $O(N^2 \log K)$.  However, this savings is largely irrelevant; the
%real cost of the structure-learning algorithm lies in the $O(N^2)$
%calls to the mixture-learning learning algorithm.  Each of these calls
%typically takes at least $O(R)$ time, where $R$ is the number of
%records in the dataset, and $R$ is typically much larger than $N$.

If {\it MAXPARS} is set to 1 and $I(X_i, X_j)$ happens to be symmetric,
then our heuristic algorithm reduces to a maximum spanning tree
algorithm.  Out of all possible Bayesian networks in which each
variable has at most one parent, this maximum spanning tree is the
Bayesian network $B_{opt}^1$ that maximizes the scoring
function~\cite{chowliu:deptrees}.  If {\it MAXPARS} is set above 1, our
heuristic algorithm will always model a {\em superset} of the
dependencies in $B_{opt}^1$, and will always find a network with at
least as high a $BIC$ score as $B_{opt}^1$'s.

There are two small details that prevent our $I(X_i, X_j)$ from being
perfectly symmetric.  First, because the mixtures we use have
redundant parameters, the number of parameters in $B_{ij}$ and
$B_{ji}$ are not necessarily equal, and so the two networks' BIC
scores may be different even if the distributions they model are
identical.  Furthermore, because of the slight adjustments we make in
our models' parameters in order to handle sparse data (as described in
the experimental results section), $B_{ij}$ and $B_{ji}$ may actually
represent slightly different distributions.  However, in practice
$I(X_i, X_j)$ is close enough to symmetric that it's usually worth
blithely pretending that it is symmetric, particularly since this cuts
down the number of calls we need to make to our mixture-learning
algorithm in order to calculate the $I(X_i, X_j)$'s by roughly a
factor of 2.

Since learning joint distributions involving real variables is
expensive, calling our mixture-learning algorithm even just $O(N^2)$
times to measure all of the $I(X_i, X_j)$'s can take a prohibitive
amount of time.  We note that the $I(X_i, X_j)$'s are only used to
choose the order in which the algorithm selects variables to move from
{\it PENDING} to {\it DONE}, and to select which arcs to try adding to the
graph.  The actual values of $I(X_i, X_j)$ are irrelevant --- the only
things that matter are their ranks, and whether they are greater than
zero.  Thus, in order to reduce the expense of computing the $I(X_i,
X_j)$'s, we can try computing them on a {\em discretized} version of
the dataset rather than the original dataset including continuous
values.  The resulting ranks of $I(X_i, X_j)$ will not generally be
the same as they would if they were computed from the original
dataset, but we would expect them to be highly correlated in 
many practical circumstances.

\section{Experiments}
\label{experimentsect}

In this section, we compare the performance of the network-learning
algorithm described above to that of four other algorithms.  Each of
the four other algorithms are designed to be similar to our
network-learning algorithm in important respects all respects but one.
First we describe a few details about how our primary network-learning
algorithm is used in our experiments, and then we describe the four
alternative algorithms.

\subsection{Algorithms}

\subsubsection{Mix-net Learner}

This is our primary network-learning algorithm.  For our experiments
on both datasets, we set {\it MAXPARS} to 3 and $K$ to 6.  When generating
any given Gaussian mixture, we give our accelerated EM algorithm
thirty seconds to find the best mixture it can.  In order to make
the most of these thirty-second intervals, we also limit our overall
training algorithm to using a sample of at most 10,000 datapoints from
the training set.  Rather than computing the $I(X_i, X_j)$'s with the
original dataset, we compute them from a version of the dataset in which 
all continuous variables have been discretized to 16 different values.
The boundaries of the 16 bins for each variable's discretization are
chosen so that the number of datapoints in each bin is approximately
equal.

Mixture tables containing many discrete variables (or a few discrete variables
each of which can take on many values) can severely overfit data, since
some combinations of the discrete variables may occur rarely in the data.
For now, we attempt to address this problem as follows:

\begin{itemize}
\item The estimates for the distribution $P(\vec{Q})$ over the discrete
variables in any given mixture are smoothed by adding half a datapoint's
worth of probability mass to each possible combination and renormalizing accordingly.
\item In addition to the Gaussian components, each mixture over 
continuous variables contains a uniform component.  This uniform
component represents an even distribution over a hyberbox that bounds 
the entire dataset.  We fix this uniform component's total probability
mass at half a datapoint's worth, and renormalize the distribution accordingly.
If there are too few datapoints in the mixture to fit even a single Gaussian,
then the mixture contains only this uniform component.

\end{itemize}

Whenever Gaussian mixtures are learned, there is a possibility that a
Gaussian will become ill-conditioned and further mathematical
operations will fail due to roundoff error.  Even worse, a Gaussian
may shrink to an arbitrarily small size around a single datapoint and
thus contribute an arbitrarily large amount to the log-likelihood of
the dataset.  We help prevent these conditions from occurring by adding
a small constant to the diagonal elements of all Gaussians'
covariance matrices.

\subsubsection{Independent Mixtures}

This algorithm will help us illustrate how much leverage our primary
network-learning algorithm gets by modelling any dependencies between
variables at all.  It is identical to our primary network-learning
algorithm in almost all respects; the main difference is that here the
{\it MAXPARS} parameter has been set to zero, thus forcing all variables to be modelled independently.  We also give this algorithm
more time to learn each individual Gaussian mixture, so that it is
given a total amount of computational time at least as great as that
used by our primary network-learning algorithm.

\subsubsection{Trees}

This algorithm will help us illustrate how much leverage our network-learning
algorithm gets by generating models more complex than the optimal
tree-shaped network.  It is identical to our primary network-learning
algorithm in all respects except that the {\it MAXPARS} parameter has
been set to one, and we give it more time to learn each individual 
Gaussian mixture (as we did for the Independent Mixtures algorithm).

\subsubsection{Single-Gaussian Mixtures}

This algorithm will help us illustrate how much leverage our primary
learning algorithm gets by using mixtures containing multiple
Gaussians.  It is identical to our primary network-learning algorithm
except for the following differences.  When learning a given Gaussian
mixture $P_i(\vec{C_i} | \vec{Q_i})$, we use a single
multidimensional Gaussian rather than a mixture.  Note, however, that
some of the marginal distributions $P_i(\vec{C_{\Pi_i}} |
\vec{Q_{\Pi_i}})$ may contain multiple Gaussians when the variable
marginalized away is discrete.  Since single Gaussians are much
easier to learn in high-dimensional spaces than mixtures are, we allow
this single-Gaussian algorithm much more freedom in creating large
mixtures: we set both {\it MAXPARS} and $K$ to 30.  We also allow it to
use all datapoints in the training set rather than restrict it to a
sample of 10,000.

\subsubsection{Pseudo-Discrete Bayesian Networks}

This algorithm is similar to our primary network-learning algorithm in
that it uses the same sort of greedy algorithm to select which arcs to
try adding to the network.  However, the networks this algorithm
produces do not employ Gaussian mixtures.  Instead, the distributions
it uses are closely related to the distributions that would be
modelled by a Bayesian network for a completely discretized version
of the dataset.  For each continuous variable $X_i$ in the domain, we
break $X_i$'s range into $F$ buckets.  The boundaries of the buckets
are chosen so that the number of records in the dataset assigning
values to $X_i$ that lie within each bucket is approximately equal.
The conditional distribution for $X_i$ is modelled with a table containing
one entry for every combination of its parent variables,
where each continuous parent variable $X_p$'s value is discretized
according to the $F$ buckets we have selected for that parent
variable.  Each entry in the table contains a conditional histogram
for $X_i$ recording the conditional probability that $X_i$'s
value lies within the boundaries of each of $X_i$'s $F$ buckets.  We
then translate the conditional probability associated with each bucket
into a conditional probability density spread uniformly throughout the
range of that bucket.  (Discrete variables are handled in a similar
manner, except the translation from conditional probabilities
to conditional probability densities is not performed.)

When performing experiments with this algorithm, we re-run it for
several different choices of $F$: 2, 4, 8, 16, and 32.  Of the
resulting networks, we pick the one that maximizes the BIC.  When the
algorithm uses a particular value for $F$, the variable interactions
$I(X_i, X_j)$ are computed using a version of the dataset that has
been discretized accordingly, and then arcs are added in a greedy
fashion as before.  The networks used by this algorithm 
do not have redundant parameters as our mix-nets do, as each node
contains only a model of its variable's conditional distribution
given its parents rather than a joint distribution.

Much research has been performed on better ways of
discretizing real variables in Bayesian networks
(e.g. \cite{monticoop:mdmlbn}, \cite{kozkol:nddohn}; the
discretization algorithm discussed here and currently implemented for
our experiments is certainly not state-of-the-art.

\subsection{Datasets}

We now describe the datasets used in our experiments.  Two minor
adjustments are made to each of the original datasets before handing
them to any of our learning algorithms.  First, all continuous variables are
scaled so that all values lie within $[0, 1]$.  This helps put the
log-likelihoods we report in context, and possibly helps prevent
problems with limited machine floating-point representation.
Second, the value of each continuous value in the dataset is randomly 
perturbed by adding to it a value uniformly selected from [-.0005, .0005].
This noise is added to eliminate any deterministic relationships or delta
functions in the data.  The log-likelihood of a continuous dataset
exhibiting even a single deterministic relationship between two variables
is infinite when given the correct model; in such a situation,
it is not clear how meaningful log-likelihood comparisons between
competing learning algorithms would be.  We add uniform noise rather than
Gaussian noise in order to prevent the introduction of a bias that
favors Gaussian mixtures.

We tested the algorithms previously described on two different
datasets taken from real scientific experiments.  For each dataset
and each algorithm, we performed ten-fold cross-validation, and
recorded the log-likelihoods of the test sets given the resulting models.

The ``Bio'' dataset contains data from a high-throughput cell
assay.  There are 12,671 records and 31 variables.  26 of the variables
are continuous; the other five are discrete.  Each discrete variable
can take on either two or three different possible values.

The ``Astro'' dataset contains data taken from the Sloan Sky Survey, an
extensive astronomical survey currently in progress.
This dataset contains 111,456 records and 68 variables.  65 of the variables
are continuous; the other three are discrete, with arities ranging from three
to 81.  

Figure 2 shows the mean log-likelihoods of the test
sets according to models generated by our five network-learning
algorithms, as well as the standard deviation of the means.  (Note
that the log-likelihoods are positive since most of the variables are
continuous and bounded within $[0, 1]$.)

\begin{figure*}
\begin{center}
\begin{tabular}{|c||c|c|c|c|c|} \hline
& {\bf Mix-Net} & Ind. Mix. & Tree & Sing. Gauss. & Pseudo-Disc. \\ \hline \hline
Bio & {\bf 82400 +/- 400} & 33200 +/- 500 & 74400 +/- 400 & 58700 +/- 100 & 59100 +/- 100 \\ \hline
Astro & {\bf 3329000 +/- 6000} & 2750000 +/- 12000 & 3287000 +/- 4000 & 2163000 +/- 3000 & N/A \\ \hline
\end{tabular}
\label{resulttable}
\caption{Average log-likelihoods of test sets in 10-way cross-validation}
\end{center}
\end{figure*}

On the Bio dataset, our mix-net learner achieved a significantly
higher log-likelihood scores than the other four model learners.  The
fact that it significantly outperformed the tree-learner indicates
that it is effectively utilizing relationships between variables more
complicated than mere pairwise dependencies.  The fact that its
networks outperformed the pseudo-discrete networks and the
single-Gaussian networks indicates that the Gaussian mixture models
used for the network nodes' parameterizations helped the network
achieve much better prediction than possible with much simpler
parameterizations.  Our primary mix-net learning algorithm took about an hour
and a half of CPU time to generate its model for each of the ten
cross-validation runs for this dataset.

As an extra test of our algorithm's robustness, we constructed an
additional synthetic dataset from the Bio dataset as follows.  All
real values in the dataset were discretized in a manner identical to
the manner in which the pseudo-discrete networks discretized them,
with 16 bins per variable.  (Out of the many different numbers of bins
we tried with the pseudo-discrete networks, 16 was the number that
worked best on the Bio dataset.)  Each discretized value was then
translated back into a real value by sampling it uniformly from the
corresponding range of original real values.  The resulting synthetic
dataset is similar in many respects to the original dataset, but its
probability densities are now composed of piecewise constant
hypervolumes --- precisely the kind of distributions that the
pseudo-discrete networks model.  This synthetic dataset causes the
pseudo-discrete network learning algorithm to produce a network
identical to the network it learns from the original dataset; the
pseudo-discrete network's test-set log-likelihood performance on this
synthetic dataset is also identical to its test-set log-likelihood
performance on the original data.  However, we might expect our
mix-nets to perform much worse than the pseudo-discrete networks on
this synthetic dataset, since the synthetic dataset's distributions
may be much harder to fix with mixtures of Gaussians.  As it turns
out, the test-set performance of our mix-nets on this synthetic
dataset {\em is} worse than the performance of pseudo-discrete
networks, but not dramatically so: the mix-net's average score on the
synthetic drops down to 58300 +/- 200.  This is significantly worse
than the pseudo-discrete networks' score, which stays at 59100 +/-
100, but the difference in score between the two types of networks is
not nearly as large as the difference on the original dataset, where
the mix-nets clearly dominated.

[{\bf AUTHORS' NOTE}: the results for the Astro dataset presented here
are preliminary.  Not all of the cross-validation experiments have yet
been completed, and we still need to check the results more closely.
The final version of this paper will include the full results.]  Our
primary algorithm similarly outperformed the other algorithms on the Astro
dataset (although we do not yet have results for the Pseudo-Discrete
networks on this domain).  Our primary learning algorithm took about 
three hours of CPU time to generate its model for each of the cross-validation
runs for this dataset.

\section{Possible Applications for Mix-Nets}
\label{appsect}

\subsection{Anomaly Detection}

One obvious application for accurate joint probability
models over large mixes of discrete and continuous variables is
anomaly detection.  The models can be used online to help detect the
presence of abnormally low-probability situations.  Alternatively,
they can be used offline on the same datasets from which they are
learned in order to rank the datapoints by their log-likelihood.  If
the learned model is accurate, the type of datapoints assigned low
log-likelihoods are probably unusual in reality as well.  For example,
we hope to use networks learned from Sloan Sky Survey data to help 
select unusual astronomical objects for further inspection by human 
investigators.

\subsection{Data Compression}

Many popular and powerful methods for data compression, such as
arithmetic coding (see, e.g.,~\cite{wnc:arith}), rely on explicit probabilistic
models of the data they are compressing.  
%When compressing discrete
%data with such methods, the length of the resulting compressed dataset
%in bits will be the number of bits required to encode the model, plus
%a number of bits proportional to the negative log-likelihood of that
%dataset given the encoded model.
Recent research
(\cite{frey:gmfmladc}, \cite{davies:bnfldc}) has shown that using
automatically learned Bayesian networks for these models can
result in compression ratios dramatically better than those achieved
by {\tt gzip} or {\tt bzip2}, while maintaining megabyte per second
decoding speeds~\cite{davies:bnfldc}.  Can this approach be extended to
real-valued data?

In order to compress real data, some loss of accuracy must usually be
accepted --- after the first few significant figures, real values
typically become impossible to model as anything other than
incompressible random noise.  Thus, the question is: how much can the
data be compressed if we are willing to accept some given average loss of
accuracy in the reconstruction?  Lossily compressing values using a
Gaussian model is a well-studied problem (see,
e.g. \cite{sayood:itdc}).  How do we lossily compress values coming
from a mixture of Gaussians?  One obvious approach would be to encode each
point as follows.  First, we calculate the likelihood with which it
came from each mixture in the Gaussian.  Suppose the maximum
likelihood Gaussian is $G_m$.  We then encode in our compressed
dataset the fact that the next datapoint is generated by $G_m$, and
then encode the datapoint using $G_m$ as our model distribution.
Unfortunately, this method of coding is suboptimal when the Gaussians
overlap.  However, the bits wasted by this suboptimality can be
recovered by using a clever ``bits-back'' method that encodes side
information by {\em not} always choosing the maximum-likelihood
Gaussian~\cite{frey:gmfmladc}.

Using automatically learned Mix-nets may be a very effective way of
compressing large datasets with both continuous and discrete values, where
the continuous values are handled by lossy compression employing bits-back
coding.

\subsection{Classification}

So far, we have only discussed learning mix-nets in situations where
our objective is to find a network that accurately models the
distribution over the entire set of variables.  What do we do if our
goal is to accurately predict the distribution of only one discrete
variable given the values of all the other variables in the domain?
An arbitrary network learned by an algorithm optimized to accurately
model the distribution over all the variables is not likely to fare
well compared to networks learned by algorithms that take the specific
prediction task at hand into consideration.

A simple, popular and effective type of classifier, the Naive Bayes
classifier, assumes that given the target variable (that is, the
variable we are trying to predict), the distributions of all other
variables are independent.  This corresponds to using a Bayesian
network in which there is an arc from the target variable to each
other variable, but no arcs between the other variables.  The
non-target variables are usually assumed to be discrete; however,
continuous variables have been handled in the past by using Gaussians
or kernel density estimators for the conditional distributions of
continuous variables (e.g., \cite{johnlang:ecdibc}).

A recently developed type of classifier, Tree Augmented Naive Bayes
(TAN)~\cite{nfriedman:bnc}, augments the network
structure of Naive Bayes with additional arcs between the non-target
variables, where each non-target variable is conditioned on at most
one other non-target variable.  The classifier has been extended to
handle continuous variables by representing each continuous variable
in the network twice: once in a discretized form, and once in a simple 
conditional parametric form\cite{nfriedman:dapf}.  

Our network-learning algorithm can easily be used to learn classifiers
similar in structure to TAN classifiers.  Before, we measured the
importance of the interactions $I(X_i, X_j)$ between a pair of
variables $X_i$ and $X_j$ by calculating the increase in the BIC score
achieved by starting from the empty network $B_{\epsilon}$ and adding
an arc from $X_i$ to $X_j$.  In order to learn a TAN-like classifier,
we simply replace $B_{\epsilon}$ with $B_{NB}$, the network
structure employed by Naive Bayes.  The pairwise interactions $I(X_i,
X_j)$'s between each pair of non-target variables $X_i$ and $X_j$ are
then computed by determining the increase in BIC score achieved by
adding a single arc to $B_{NB}$ that goes from $X_i$ to $X_j$.  After
we have computed $I$, we initialize our current working network to
$B_{NB}$; then we put the target variable in {\it DONE}, and all the
non-target variables in {\it PENDING}.  We set {\it MAXPARS} to two --- the
target variable (which we've already added as a parent) plus one more
possible parent --- and run our greedy network-learning algorithm just
as before.  This procedure gives us a TAN-like classifier, except that
complex interactions between continuous variables can be modelled, as
well as interactions between continuous and discrete variables,
without requiring any discretization of the continuous variables.  
On the other hand, if we set {\it MAXPARS} above two, the resulting network may
capture even more dependencies between non-target variables and it will
achieve a BIC score at least as high as that achieved by the TAN-like
classifier, although this may not necessarily translate into better
classification accuracy.

In addition to classification tasks in which the target variable is
discrete, the above network-learning algorithm can also be used 
to learn networks for predicting a probability density function of
a continuous variable given the value of all other variables in
the domain.

\subsection{Inference}
While it is possible to perform exact inference in some kinds of
networks modelling continuous values (e.g. \cite{alg:iumpattivgcn}, 
\cite{drivmorr:iocbnusowg}), 
exact inference in arbitrarily-structured mix-nets with continuous
variables is probably not possible.  However, inference in these
networks can be performed via stochastic sampling methods.  Given a
mixture table representing the joint probability of a variable and its
parents in the Bayesian network, it is possible to sample the
distribution of the child variable given the values of its parent
variables.  Thus, given a mix-net, we can easily employ likelihood
weighting to generate a set of weighted datapoints representing
a sample from any conditional distribution we wish.  

\section{Conclusions}
\label{conclsect}

We have described a practical method for learning Bayesian networks
capable of modelling complex interactions between multiple continuous
and discrete variables, and have provided experimental results showing
that the method is both feasible and effective on scientific data with
dozens of variables.  The networks learned by this algorithm and related
algorithms show considerable potential for many important applications.
However, there are many ways in which our method can be improved upon.
We now briefly discuss a few of the more obvious possibilities for
improvement.

\subsection{Variable Grouping}

The mixture tables in our network include a certain degree of
redundancy, since the mixture table for each variable models the joint
probability of that variable with its parents rather than just the
conditional probability of that variable given its parents.  For
example, consider a completely connected network containing $N$
continuous variables in which the joint probability of each variable
and its parents is modelled as a single multidimensional Gaussian.  In
this case, our network will have $O(N^3)$ parameters, despite the fact
that the overall distribution modelled by the network is actually just
a single multidimensional Gaussian representable with $O(N^2)$
parameters.  This wastes memory and computational time.  Perhaps more
importantly, the larger number of parameters may cause a network-learning
algorithm to favor a simpler model with fewer parameters, even if there
is enough data to justify the $O(N^2)$ parameters that would be used
by a single multidimensional Gaussian.

One possible approach for reducing the number of redundant parameters
in the network is to allow the each node in the network to represent
a {\em group} of variables.  Each variable must be represented in exactly
one group.  A group $G_i$ representing multiple
variables $\vec{X_i}$ simply contains a mixture table modelling 
$P_i(\vec{X_i} | \Pi_i)$, where $\Pi_i$ is the set of $G_i$'s 
parent variables.  This mixture table can be marginalized to obtain
$P_i(\Pi_i)$ (and thus $P_i(\vec{X_i} | \Pi_i)$) just as we did in the case 
where each node represented only one variable.  

\begin{figure}
\epsfxsize=2in
\centerline{
\epsffile{huffgraph.eps}}
\caption{An example mix-net in which six variables are represented by a graph with three groups.}
\label{huffgraph}
\end{figure}

Suppose $X_p$ is a parent variable of node $N_i$, and that $X_p$ is 
represented in some other node $N_j$ that also includes some other variable
$X_q$.  It is interesting to note that it is {\em not} necessary to include
$X_q$ in the distribution modelled at node $N_i$.  For example, 
Figure~\ref{huffgraph} shows a mix-net with three nodes and
six variables.  Group 3 contains a mixture table modelling 
$P(X_2, X_3, X_4, X_5, X_6)$, but $X_1$ is not included in this model
even though the two other variables in Group 1.  This example network
decomposes the probability distribution over $X_1, \ldots, X_6$ as
follows:

{\small \[P(X_1, \ldots, X_6) = P_1(X_1, X_4, X_5) P_2(X_3) 
\frac{P_3(X_2, X_3, X_4, X_5, X_6)}{P_3(X_3, X_4, X_5)} \]}

where $P_1$, $P_2$, and $P_3$ are computed from three different
mixture tables that are not necessarily consistent with each other.

Grouping variables together in such a manner allows us to reduce the
number of parameters in the network.  For example, a single
multidimensional Gaussian over $N$ variables can now be represented
with $O(N^2)$ parameters by simply placing them all into a single group.
An interesting line of research would be to design mix-net learning algorithms
that automatically group variables together in order to reduce the
total number of parameters.

\subsection{Better Structure-Learners}

So far we have only developed and experimented with variations of one
particular network structure-learning algorithm.  There is a wide
variety of structure-learning algorithms for typical Bayesian
networks, many of which could be employed when learning mix-nets.  The
quicker and dirtier of these algorithms might be applicable directly
to learning mix-net structures.  The more time-consuming algorithms
such as hillclimbing can be used to learn Bayesian networks on
discretized versions of the datasets; the resulting networks may then
be used as hints for which sets of dependencies might be worth trying
in a mix-net.

\subsection{Better Parameter-Learners}

While the accelerated EM algorithm we use to learn Gaussians mixtures
is very fast for low-dimensional mixtures and comes up with fairly
accurate models, its speed decreases dramatically as the number of
variables in the mixture increases.  This is the primary reason we
have not yet attempted to learn mixture networks with more than four
variables per mixture.  Further research is currently being conducted on 
alternate data structures and algorithms which with to accelerate
EM in the hopes that they will scale more gracefully to higher dimensions
(REF?).  

Similarly, our current method of handling discrete variables does not
deal very well with discrete variables that can take on many possible
values, or with combinations involving many discrete variables.  
Better methods of dealing with these situations are also grounds
for further research.

%\subsubsection*{References} 

\bibliography{scottd}
\bibliographystyle{plain}
 
\end{document} 



IMPORTANT REFERENCES: 

Driver & Morrel, UAI 95 pp. 134-140
Networks in which prior and conditional distributions are mixtures of
Gaussians; inference in singly connected networks

Heckerman and Geiger, UAI 95, p. 274
``Learning Bayesian Networks: A Unification of Discrete and Gaussian Domains''

John and Langle, p. 338, UAI 95
``Estimating Continuous Distributions in Bayesian Classifiers''
Each conditional probability is either a single gaussian or a kernel estimator.
One single discrete parent class is only source of arcs.

Satnam Alag, UAI 96, p. 20
``Inference Using Message Propogation and Topology Transformation in Vector
Gaussian Continuous Networks''
Nodes represent multivariate Gaussian distributions.  Inference rules,
etc. 

Kozlov & Koller, UAI 97, p. 314
``Nonuniform dynamic discretization of hybrid networks''
Tree-based discretization of continuous spaces; gives rules for
doing inference with them

Monti & Cooper, UAI 98
``A Multivariate Discretization Method for Learning Bayesian Networks 
from Mixed Data''

Monti & Cooper, UAI 99
A Bayesian network classifier that combines a finite-mixture model and a 
naive-Bayes model.

Friedman, Geiger, and Goldszmidt, MLJ 97
Bayesian Network Classifiers

Friedman & Goldszmidt, ICML 98
Bayesian network classification with continuous attributes:
Getting the best of both discretization and parametric fitting
Extends ``Bayesian network classifiers'' so that each feature is
modelled both by a single Gaussian and a discretized version

