CALL FOR PAPERS


ICML 2003 Workshop on
The Continuum from Labeled to Unlabeled Data
In Machine Learning and Data Mining

August 21st, 2003
Washington D.C.

http://www.accenture.com/techlabs/icmlworkshop2003/

Important Dates:
Papers Due: May 5, 2003
Notification: May 25, 2003
Final Version Due: June 10, 2003
Workshop: August 21, 2003


There is a spectrum of ways to use data in machine learning and data mining. At the one end is completely unsupervised learning or clustering, and at the other end is supervised learning where the target output is known for every instance.

This workshop aims to explore the space between these extremes, with particular attention to a variety of real-world applications, and sources of labels. Techniques that have been proposed include learning from unlabeled data with hints, learning from unlabeled and positive-only labeled data, learning from distantly and noisily labeled data, combining labeled and unlabeled data with cotraining, EM and other semi-supervised techniques, and transductive learning, where the test data is added as an additional source of unlabeled data. The possible sources of labels and hints are also broad: systematic hand-labeling, labels acquired through active learning, and hints derived from domain knowledge are among the techniques which may be used.

Papers addressing novel types of data, methods of diagnosing when unlabeled data will help and when it will hinder, and applying techniques across multiple application domains and multiple levels of supervision are particularly encouraged. Papers discussing the acquisition of labels from real-world experts in real-world data mining problems are also encouraged. Data mining practitioners working on real-world problems with large amounts of captured/stored data but a high cost labeling process are encouraged to submit problem descriptions and possible solutions.

Workshop Format
The workshop will consist of both regular paper presentations, and debates.

Regular Papers
Regular papers can be up to 8 pages, and may address work in progress.
Papers should be in the format required for ICML submissions. The formatting instructions can be found at
http://www.hpl.hp.com/conferences/icml03/formats/index.html

Problem Descriptions from Machine Learning/Data Mining Practitioners
1-2 page papers describing a problem domain you have encountered or dealt with where training data and/or labels are very expensive or hard to obtain. The paper would present a problem statement, give background on the domain, and list sources and amount of available training data. We hope these types of papers will encourage participation from people working on practical applications where unlabeled data can potentially be valuable but is not currently utilized.

Debate Position Papers
2 page position papers on either side of the following topics are solicited.
Accepted papers will be published in the workshop proceedings, and authors will be expected to debate their position. Topics not on this list are also acceptable, if you can coherently argue both sides, or can encourage a colleague to submit the opposing position.
    o Unlabeled data is only useful when there are a large number of redundant features
    o Why doesn't The No Free Lunch Theorem apply when working with unlabeled data
    o Unlabeled data has to come from the same underlying distribution as the labeled data
    o Can unlabeled data be used in temporal domains?
    o Feature engineering is more important than algorithm design for semi-supervised learning
    o All the interesting problems in semi-supervised learning have been identified
    o Active learning is an interesting *academic* problem
    o Active learning research without user interface design is only solving half the problem
    o Using Unlabeled data in Data Mining is no different than using it in Machine Learning
    o Massive data sets pose problems when using current semi-supervised algorithms
    o Off-the-shelf data mining software incorporating labeled and unlabeled data is a fantasy
    o Unlabeled data is only useful when the classes are well separated

Submissions should be sent by May 5, 2003 as PDF or PostScript files to Rayid.Ghani@accenture.com

Organizers:
Rayid Ghani
Accenture Technology Labs, 161 N. Clark St, Chicago, IL 60601 +1 312-693-6653
http://www.accenture.com/techlabs/ghani/
rayid.ghani@accenture.com

Rosie Jones
Overture Services, 74 N Pasadena Ave, 3F, Pasadena, CA 91107 +1 626-229-8536
http://www.cs.cmu.edu/~rosie/
rosie.jones@overture.com

Chuck Rosenberg, Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA 15213 +1 412-268-8078
http://www.cs.cmu.edu/~chuck
chuck@cs.cmu.edu

Program Commitee:
Kristin Bennett, Rennselear Polytechnic Institute
Mark Craven, University of Wisconsin
Zoubin Ghahramani, Gatsby Computational Neuroscience Unit, UCL
Sally Goldman, Washington University, St. Louis
Tony Jebara, Columbia University
Thorsten Joachims, Cornell University
Stefan Kremer, University of Guelph
Bing Liu, National University of Singapore
Andrew McCallum, University of Massachusetts
Ray Mooney, University of Texas, Austin
Ion Muslea, University of California, Irvine
Kamal Nigam, IntelliSeek
Ellen Riloff, University of Utah
Dale Schuurmans, University of Waterloo
Martin Szummer, Microsoft Research, Cambridge
Sarah Zelikovitz, City University of New York
Tong Zhang, IBM Research, Yorktown Heights