I was involved in some of the earlier work toward learning stochastic
circuit models (a kind of temporal graphical model) of gene regulation
from microarray expression data. This method was a statistical
physicsbased (deterministic annealing) clustering model and algorithm
in which a datum could truly belong to multiple clusters
simultaneously. (Mjolsness, Castano, and Gray, MultiParent
Clustering Algorithms for LargeScale Gene Expression, JPL Report
1999. Mjolsness at al. Clustering Methods for the Analysis of
C. elegans Gene Expression Array Data, PSB 1999.)


Computational Chemistry,
Drug Discovery, Biology
 
Along with many other people today, I believe that applied
mathematical approaches (computational, statistical, engineering) to
chemical and biological problems represent one of the most fruitful
scientific goldmines for the next few decades at least. I would say,
however, that this is really a special case of a more general
opportunity  all of the natural sciences should and will be transformed
by such approaches in the near future. We have a large collaboration with a
major pharmaceutical company.

Molecule Ranking for Virtual Screening
 
Automated highthroughput drug screening constitutes a critical
emerging approach in modern pharmaceutical research. The statistical
task of interest is that of discriminating active versus inactive
molecules given a target molecule, in order to rank potential drug
candidates for further testing. Because the core problem is one of
ranking, our approach concentrates on accurate estimation of unknown
class probabilities, in contrast to popular nonprobabilistic methods
which simply estimate decision boundaries. While this motivates
nonparametric density estimation, we are faced with the fact that the
molecular descriptors used in practice typically contain thousands of
binary features. In this paper we attempt to improve the extent to
which kernel density estimation can work well in highdimensional
classification settings. We present a synthesis of techniques
(SLAMDUNK: Sphere, Learn A Metric, Discriminate Using Nonisotropic
Kernels) which yields favorable performance in comparison to previous
published approaches to drug screening such as support vector
machines, as tested on a large highdimensional proprietary
pharmaceutical dataset. (Gray, Komarek, Liu, and Moore,
HighDimensional Probabilistic Classification for Drug Discovery [pdf],
[ps]
Computational Statistics 2004.) I consider this to just be a first
dip into the sea of this problem. Whether we use this method as a
component of the eventual system depends on the next step.
Next step:
The ultimate goal is to automatically choose which molecules
to test, in a running system. I am working right now on a new formulation
of active learning which is not based on the traditional leastsquares
approach to experimental design. Another huge subproblem in this
enterprise is the effective representation of molecules.

