CALL FOR PAPERS
ICML 2003 Workshop on
The Continuum from Labeled to Unlabeled Data
In Machine Learning and Data Mining
August 21st, 2003
Washington D.C.
http://www.accenture.com/techlabs/icmlworkshop2003/
Important Dates:
Papers Due: May 5, 2003
Notification: May 25, 2003
Final Version Due: June 10, 2003
Workshop: August 21, 2003
There is a spectrum of ways to use data in machine learning and data mining. At
the one end is completely unsupervised learning or clustering, and at the other
end is supervised learning where the target output is known for every instance.
This workshop aims to explore the space between these extremes, with particular
attention to a variety of real-world applications, and sources of labels.
Techniques that have been proposed include learning from unlabeled data with
hints, learning from unlabeled and positive-only labeled data, learning from
distantly and noisily labeled data, combining labeled and unlabeled data with
cotraining, EM and other semi-supervised techniques, and transductive learning,
where the test data is added as an additional source of unlabeled data. The
possible sources of labels and hints are also broad: systematic hand-labeling,
labels acquired through active learning, and hints derived from domain knowledge
are among the techniques which may be used.
Papers addressing novel types of data, methods of diagnosing when unlabeled data
will help and when it will hinder, and applying techniques across multiple
application domains and multiple levels of supervision are particularly
encouraged. Papers discussing the acquisition of labels from real-world experts
in real-world data mining problems are also encouraged. Data mining
practitioners working on real-world problems with large amounts of
captured/stored data but a high cost labeling process are encouraged to submit
problem descriptions and possible solutions.
Workshop Format
The workshop will consist of both regular paper presentations, and debates.
Regular Papers
Regular papers can be up to 8 pages, and may address work in progress.
Papers should be in the format required for ICML submissions. The formatting
instructions can be found at
http://www.hpl.hp.com/conferences/icml03/formats/index.html
Problem Descriptions from Machine Learning/Data Mining Practitioners
1-2 page papers describing a problem domain you have encountered or dealt with
where training data and/or labels are very expensive or hard to obtain. The
paper would present a problem statement, give background on the domain, and list
sources and amount of available training data. We hope these types of papers
will encourage participation from people working on practical applications where
unlabeled data can potentially be valuable but is not currently utilized.
Debate Position Papers
2 page position papers on either side of the following topics are solicited.
Accepted papers will be published in the workshop proceedings, and authors will
be expected to debate their position. Topics not on this list are also
acceptable, if you can coherently argue both sides, or can encourage a colleague
to submit the opposing position.
o Unlabeled data is only useful when there are a large number
of redundant features
o Why doesn't The No Free Lunch Theorem apply when working
with unlabeled data
o Unlabeled data has to come from the same underlying
distribution as the labeled data
o Can unlabeled data be used in temporal domains?
o Feature engineering is more important than algorithm design
for semi-supervised learning
o All the interesting problems in semi-supervised learning
have been identified
o Active learning is an interesting *academic* problem
o Active learning research without user interface design is
only solving half the problem
o Using Unlabeled data in Data Mining is no different than
using it in Machine Learning
o Massive data sets pose problems when using current
semi-supervised algorithms
o Off-the-shelf data mining software incorporating labeled
and unlabeled data is a fantasy
o Unlabeled data is only useful when the classes are well
separated
Submissions should be sent by May 5, 2003 as PDF or PostScript files to
Rayid.Ghani@accenture.com
Organizers:
Rayid Ghani
Accenture Technology Labs, 161 N. Clark St, Chicago, IL 60601 +1 312-693-6653
http://www.accenture.com/techlabs/ghani/
rayid.ghani@accenture.com
Rosie Jones
Overture Services, 74 N Pasadena Ave, 3F, Pasadena, CA 91107 +1 626-229-8536
http://www.cs.cmu.edu/~rosie/
rosie.jones@overture.com
Chuck Rosenberg, Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA
15213 +1 412-268-8078
http://www.cs.cmu.edu/~chuck
chuck@cs.cmu.edu
Program Commitee:
Kristin Bennett, Rennselear Polytechnic Institute
Mark Craven, University of Wisconsin
Zoubin Ghahramani, Gatsby Computational Neuroscience Unit, UCL
Sally Goldman, Washington University, St. Louis
Tony Jebara, Columbia University
Thorsten Joachims, Cornell University
Stefan Kremer, University of Guelph
Bing Liu, National University of Singapore
Andrew McCallum, University of Massachusetts
Ray Mooney, University of Texas, Austin
Ion Muslea, University of California, Irvine
Kamal Nigam, IntelliSeek
Ellen Riloff, University of Utah
Dale Schuurmans, University of Waterloo
Martin Szummer, Microsoft Research, Cambridge
Sarah Zelikovitz, City University of New York
Tong Zhang, IBM Research, Yorktown Heights