David Cohn (2003).
Low rank approximation techniques are widespread in pattern
recognition research --- they include Latent Semantic Analysis (LSA),
Probabilis tic LSA, Principal Components Analysus (PCA), the
Generative Aspect Model, and many forms of bibliometric analysis. All
make use of a low-dimensional manifold onto which data are
projected.

Such techniques are generally "unsupervised," which
allows them to model data in the absence of labels or categories.
With many practical problems, however, some prior knowledge is
available in the form of context. In this paper, I describe a
principled approach to incorporating such information, and demonstrate
its application to PCA-based approximations of several data sets.

David Cohn and Thomas Hofmann (2001).
We describe a joint probabilistic model for modeling the contents
and inter-connectivity of document collections such as sets
of web pages or research paper archives. The model is based on
a probabilistic factor decomposition and allows identifying principal
topics of the collection as well as authoritative documents within
those topics. Furthermore, the relationships between topics is
mapped out in order to build a predictive model of link content. Among the
many applications of this approach are information retrieval and
search, topic identification, query disambiguation, focused web
crawling, web authoring, and bibliometric analysis.

Adam
Berger, Rich Caruana, David Cohn, Dayne Freitag, and Vibhu Mittal
(2000).
This paper investigates whether a machine can automatically learn the
task of finding, within a large collection of candidate responses, the
answers to questions. The learning process consists of inspecting a
collection of answered questions and characterizing the relation
between question and answer with a statistical model. For the purpose
of learning this relation, we propose two sources of data: Usenet FAQ
documents and customer service call-center dialogues from a large
retail company. We will show that the task of "answer-finding" differs
from both document retrieval and traditional question-answering,
presenting challenges different from those found in these
problems. The central aim of this work is to discover, through
theoretical and empirical investigation, those statistical techniques
best suited to the answer-finding problem.

David Cohn and Huan Chang, (2000).
We describe a model of document citation that learns to identify
hubs and authorities in a set of linked documents, such as pages
retrieved from the world wide web, or papers retrieved from a research
paper archive. Unlike the popular HITS algorithm, which relies on
dubious statistical assumptions, our model provides probabilistic
estimates that have clear semantics. We also find that in general,
the identified authoritative documents correspond better to human
intuition.

Huan Chang, David Cohn and Andrew McCallum (2000).
The proliferation of hypertext and the popularity of Kleinberg's HITS
algorithm have brought about an increased interest in link analysis.
While HITS and its older relatives from the Bibliometrics literature
provide a method for finding authoritative sources on a particular
topic, they do not allow individual users to inject their own opinions
about what sources are authoritative. This paper presents a learning
technique which incorporates user feedback to adjust its model of
document authority so that it better corresponds to the user's
internal model. We demonstrate how this technique can be used to
generate user-specific authority lists. We present experimental
results based on a database of about one million references collected
as part of the Cora on-line index of the computer science
literature.

Greg Schohn and
David Cohn (2000).
We describe a simple active learning heuristic which greatly enhances
the generalization behavior of support vector machines (SVMs) on
several practical document classification tasks. We observe a number
of benefits, the most surprising of which is that a SVM trained on a
well-chosen subset of the available corpus frequently performs better
than one trained on all available data. The heuristic for
choosing this subset is simple to compute, and makes no use of
information about the test set. Given that the training time of SVMs
depends heavily on the training set size, our heuristic not only
offers better performance with fewer data, it frequently does so in
less time than the naive approach of training on all available data.

Brigham Anderson, Andrew Moore and David Cohn (2000).
This paper describes pairwise bisection, a nonparametric approach to
optimizing a noisy function with few function evaluations. The
algorithm uses nonparametric reasoning about simple geometric
relationships to find minima efficiently. Two factors often frustrate
optimization: noise and cost. Output can contain significant
quantities of noise or error, while time or money allow for only a
handful of experiments. Pairwise bisection is used here to attempt to
automate the process of robust and efficient experiment design. Real
world functions also tend to violate traditional assumptions of
continuousness and Gaussian noise. Since nonparametric statistics do
not depend on these assumptions, thisalgorithm can optimize a wide
variety of phenomena with fewer restrictions placed on noise. The
algorithm's performance is compared to that of three competing
algorithms, Amoeba, PMAX and Q2 on several different test
functions. Results on these functions indicate competitive performance
and superior resistance to noise.

Satinder Singh and David Cohn,
(1998).
We are frequently called upon to perform multiple tasks that
compete for our attention and resource. Often we know the optimal
solution to each task in isolation; in this paper, we describe how
this knowledge can be exploited to efficiently find good solutions for
doing the tasks in parallel. We formulate this problem as that of
dynamically merging multiple Markov decision processes (MDPs) into a
composite MDP, and present a new theoretically-sound dynamic
programming algorithm for finding an optimal policy for the composite
MDP. We analyze various aspects of our algorithm and illustrate its
use on a simple merging problem.

David Cohn (1997).
I describe a querying criterion that attempts to minimize the error of
a learner by minimizing its estimated squared bias. I describe
experiments with locally-weighted regression on two simple problems,
and observe that this `bias-only' approach outperforms the more
common `variance-only' exploration approach, even in the presence of
noise.

David Cohn and Satinder Singh, (1997).
Predictions of lifetimes of dynamically allocated objects can
and have been used to improve time and space efficiency of dynamic
memory management in computer programs. Barrett and Zorn (1993) used a
simple lifetime predictor and demonstrated this improvement on a
variety of computer programs. In this paper, we use decision trees to
do lifetime prediction on the same programs and show significantly
better prediction. Our method also has the advantage that during
training we can use a large number of features and let the decision
tree automatically choose the relevant subset.

David Cohn, Zoubin
Ghahramani, and Michael Jordan, (1996).
For many types of machine learning algorithms, one can compute the
statistically `optimal' way to select training data. In this paper, we
review how optimal data selection techniques have been used with
feedforward neural networks. We then show how the same principles may
be used to select data for two alternative, statistically-based
learning architectures: mixtures of Gaussians and locally weighted
regression. While the techniques for neural networks are
computationally expensive and approximate, the techniques for mixtures
of Gaussians and locally weighted regression are both efficient and
accurate. Empirically, we observe that the optimality criterion
sharply decreases the number of training examples the learner needs in
order to achieve good performance.

David Cohn, Les Atlas, and
Richard Ladner, (1994).
Active learning differs from passive "learning from examples" in
that the learning algorithm assumes at least some control over what
part of the input domain it receives information about. In some
situations, active learning is provably more powerful that learning
from examples alone, giving better generalization for a fixed number
of training examples.
In this paper, we consider the problem of learning a binary concept in
the absence of noise (Valiant 1984). We describe a formalism for
active concept learning called *selective sampling*, and show how
it may be approximately implemented by a neural network. In selective
sampling, a learner receives distribution information from the
environment and queries an oracle on parts of the domain it considers
"useful." We test our implementation, called an *SG-network*, on
three domains, and observe significant improvement in generalization.

David Cohn, Eve Riskin, and Richard Ladner, (1994).
We examine how the performance of a memoryless vector quantizer
changes as a function of its training set size. Specifically, we study
how well the training set distortion predicts test distortion when the
training set is a randomly drawn subset of blocks from the test or
training image(s). Using the Vapnik-Chervonenkis dimension, we derive
formal bounds for the difference of test and training distortion of
vector quantizer codebooks. We then describe extensive empirical
simulations that test these bounds for a variety of bit rates and
vector dimensions, and give practical suggestions for determining the
training set size necessary to achieve good generalization from a
codebook. We conclude that, by using training sets comprised of only
a small fraction of the available data, one can produce results that
are close to the results obtainable when all available data are used.

David Cohn,
(1994).
Consider the problem of learning input/output mappings through
exploration, e.g. learning the kinematics or dynamics of a robotic
manipulator. If actions are expensive and computation is cheap, then
we should explore by selecting a trajectory through the input space
which gives us the most amount of information in the fewest number of
steps. I discuss how results from the field of optimal experiment
design may be used to guide such exploration, and demonstrate its use
on a simple kinematics problem.

David Cohn,
(1996). (expanded version of above paper)
I consider the question "How should one act when the only goal is to
learn as much as possible?" Building on the theoretical results of
Fedorov and MacKay, I apply techniques from Optimal Experiment Design
(OED) to guide the query/action selection of a neural network learner.
We demonstrate that these techniques allow the learner to minimize its
generalization error by exploring its domain efficiently and
completely. I conclude that, while not a panacea, OED-based
query/action has much to offer, especially in domains where its high
computational costs can be tolerated.

---------------------------------------------