William W. Cohen
William Cohen received his bachelor's degree in Computer Science from
Duke University in 1984, and a PhD
in Computer Science from Rutgers
University in 1990. From 1990 to 2000 Dr. Cohen worked at AT&T Bell Labs and later AT&T Labs-Research, and from
April 2000 to May 2002 Dr. Cohen worked at Whizbang Labs, a company
specializing in extracting information from the web. Dr. Cohen is
President of the International Machine Learning
Society, an Action Editor for the Journal of Machine Learning
Research, and an Action Editor for the journal ACM Transactions on Knowledge
Discovery from Data. He is also an editor, with Ron Brachman,
of the AI
and Machine Learning series of books published by Morgan Claypool. In the
past he has also served as an action editor for the journal Machine
Learning, the journal Artificial
Intelligence, and the Journal
of Artificial Intelligence Research. He was General Chair for
the 2008 International
Machine Learning Conference, held July 6-9 at the University of Helsinki,
Program Co-Chair of the 2006 International
Machine Learning Conference; and Co-Chair of the 1994
International Machine Learning Conference. Dr. Cohen was also the
co-Chair for the 3rd
Int'l AAAI Conference on Weblogs and Social Media, which was held
May 17-20, 2009 in San Jose, and was the co-Program Chair for the 4rd Int'l AAAI Conference
on Weblogs and Social Media, which will be held May 23-26 at
George Washington University in Washington, D. C. He is a AAAI Fellow,
and in 2008, he won the SIGMOD
"Test of Time" Award for the most influential SIGMOD paper of
Dr. Cohen's research interests include information integration and
machine learning, particularly information extraction, text
categorization and learning from large datasets. He holds seven
patents related to learning, discovery, information retrieval, and
data integration, and is the author of more than 180 publications.
Projects I'm currently (or recently) involved with include:
- Querendipity, an adaptive personal
information management system for biologists.
- SEAL, a Google-Sets-like bootstrapping tool written by my former student Richard Wang.
- Read the Web, a large-scale web extraction system (which uses SEAL, as well as other techniques).
- SimStudent, a project that adds learning-by-demonstration to CTAT.
- SLIF, a system that analyzes the text and images
in online journal articles to find information about the subcellular localization of proteins.
an open-source Java package of information extraction software. (Note: we've
migrated the code now from SourceForge to GitHub.)
Measure twice, cut once - Vitor and Ramnath have developed a Thunderbird
plugin that implements recipient
recommendation and leak
detection for email. It modifies Thunderbird by adding an
additional pane that pops up after you send a message, giving you one
final chance to fix any errors in your recipient list. There's a
brief writeup on how to use it, but it's
pretty self-explanatory: just download it, open Thunderbird, and go to
the tools->addon menu to install. After you've installed it, you
train by opening your folder of "Sent" mail and pressing the "train"
button. (This took about an hour for my 9000+ old messages.)
Nallapati has put together two nice demos of his multiscale topic tomography topic-modeling technique, one
for articles from Science,
and one with cancer-related
articles from PubMed.
Here are two movies that demo SimStudent, a programming-by-demonstration
system for constructing cognitive tutors, built by Noboru Matsuda.
The following datasets are available for anyone to use for research
- Code for Ni
Lao's PRA method (described in our ECML paper) is
Frank Lin's home page now contains
- the code
for power iteration clustering (the algorithm described in our
ICML-2010 paper) as well as
we used in the experiments.
- the code
for MultiRandomWalk (the semi-supervised learning algorithm described in our
ASONAM-2010 paper) as well as
we used in those experiments.
- Minorthird is an
open-source Java package of information extraction and text
classification learning tools.
I am now distributing a standalone tool, built on Minorthird, for
annotating biomedical text. This is particularly aimed at annotating
figure captions but might be useful for other text as well. The jar file for this is rather large
(17M), as it includes a Minorthird jar. There is documentation available for this,
and some sample data.
My former student Vitor Carvalho distributes the poetically named Jangada and
which are also standalone apps built on top of Minorthird, to analyze
another open-source Java package, of approximate string matching
- SLIPPER is an old old
rule-learning system Yoram Singer and I developed. This code is
provided with absolutely no warranty, promise of support, or really,
any expectation that it will keep working. You are totally on your
own with this one, friend.
- WHIRL is another old system I wrote. Currently, I am not
distributing it, but ask me if you're interested in reviving the
- To get a copy of RIPPER, please send mail to my evil twin brother,
wcohen -AT- gmail.com.
As an alternative to that ancient code: I haven't used it myself, but
I've heard good things about
J-RIP, a Ripper clone written for WEKA.
- Data sets for my paper
"Crowdsourced Comprehension: Predicting Prerequisite Structure in
Wikipedia" with Partha Talukdar from BEA-2012.
of HTML Tables, hyponyms, as well as extracted entity clusters and MLT
evaluations, all associated with
on WebSets from WSDM-2012.
- The network
datasets used in the experiments of our ICML-2010 paper
are on Frank Lin's home page.
100,000+ bibliography entries, in the original BibTeX format, converted to an EndNote-like format, and in a featurized format, for experiments with matching (60M).
A 56k-node, 200k-edge graph containing data from SGD and PubMed, used in Querendipity.
messages from 20 Newsgroups, annotated for reply bodies and
signatures, prepared by my former student Vitor Carvalho
Two subsets of the Enron data, annotated with person names,
prepared by my student Einat
- Enron email dataset
(400Mb, once you get there) contains 800,000+ emails from 150 users+
organized into 4700+ folders.
- Some more email data: about two
thousand messages released to the public as part of the ongoing investigation
of US Attorney firings at the Dept of Justice. This is very
strange data---the original email is released as scanned printouts in
PDF (?!), so most of the text is not available. There are links to
copies of the PDF, some manually added annotations, and a (apparently
manually-reconstructed) social network graph. About 1.5Mb (in Excel
format). From Mark
Johnson, and a network of volunteers.
- A collection of various extraction datasets
in Minorthird format (6Mb), including about 1000 Enron emails tagged
for person names and temporal expressions.
- classify.tar.gz (0.4Mb) contains
nine problems in which the goal is to classify short entity names.
This data was used in Joins that Generalize: Text Classification
Using WHIRL (KDD-98).
- ranking.tar.gz (8Mb) contains the
data used for the meta-search experiments in my JAIR paper Learning to Order
Things (with Rob Schapire and Yoram Singer).
- match.tar.gz (0.7Mb) contains a suite of
labeled entity-name matching and clustering problems
(i.e. problems for which the correct matches/clusters are provided),
in a single consistent format. In most cases WHIRL's performance is
given as a benchmark. (These are also distributed in the RIDDLE
Repository. Extraction-oriented versions of some of this data are
available on the RISE
Repository. (I.e., represented as a problem of extracting data from
a website, rather than matching two datasets).)
- whirl-bench.tgz (1.1Mb) contains some
more WHIRL-format entity name matching problems.
- Reasoning With Data Extracted from The Biomedical Literature,
invited talk at a joint session of the AAAI Fall Symposia on Discovery Informatics, and
Information Retrieval and Knowledge Discovery in Biomedical Text.
- Learning Similarity Relations Based on Random Walks in Graphs,
invited talk at CIKM 2012, October, 2012.
- Fast Effective Clustering for Graphs and Documents, given at CMU's LTI Colloquium Feb 10, 2012.
- Learning to Extract a Broad-Coverage
Knowledge Base from the Web, invited talk at the Symposium on
Data-Intensive Analysis, Analytics, and Informatics, Pittsburgh, PA Apr 2011.
- Open Information Extraction Methods:
Computers that Learn to Read, invited talk at National Federation
of Advanced Information Services (NFAIS), Philadelpha, PA, Feb 2011.
- Learning Proximity Relations Defined by
Linear Combinations of Constrained Random Walks, given at a
seminar at the University of Maryland in Sep 2010.
- Modeling Entity-Entity Links
and Entity-Annotated Text, given at the ICML 2010 Workshop on
- Predictively Modeling Social Media,
invited talk given at
the 1st International Workshop on Mining Social Media, co-located with 13th Conference of the Spanish Association for Artificial Intelligence (CAEPIA-TTIA 2009).
- Matching and clustering product descriptions
using learned similarity metrics, invited talk given at
the IJCAI 2009 Workshop on Information Integration on the Web, July 2009. (Powerpoint; 6.7M)
- Open information extraction talks:
- Embodied Cognition and Knowledge:
Integration of Heterogeneous Databases without Common Domains Using
Queries Based on Textual Similarity, talk given for my 10-year
"Test of Time" Award at SIGMOD-2008(Powerpoint; 11Mb)
- Using Machine Learning to Discover
and Understand Structured Data, invited talk given at LinkedData
2008. (Powerpoint; 6Mb)
- Machine Learning for Personal Information
Management, invited talk given at ICMLA-2007. (Powerpoint; 8Mb)
- A Framework for Learning to Query Heterogeneous Data,
invited talk given at IQIS 2006. (Powerpoint; 8Mb)
- On Beyond Hypertext: Searching in Graphs
Containing Documents, Words, and Actual Data, invited talk given
at DB/IR Day 2006. (Powerpoint; 6Mb)
- A Century Of Progress On Information
Integration: A Mid-Term Report, an overview of information
integration, focusing modestly on my own work, given as invited
talk at WebDB-2005. (Powerpoint;
- Information extraction (PowerPoint;
4.8Mb), aimed at folks somewhat familiar with statistical NLP
methods. And thanks to Thierry Poibeau, there's also a version en francais (did I get that right, Thierry?)
Also, two earlier versions of this are also still around, both
given with Andew McCallum at recent conferences, KDD-2003(PowerPoint; 6.8Mb) and NIPS-2002.
- Text classification
(PowerPoint; 3Mb), given at a CALD Summer Course.
filtering (PowerPoint; 9.1Mb), given at a DIMACS workshop.
- A mini-course on record linkage and matching:
- Other technical talks:
- Fall 2012: ML 10-802 and LTI 11-772 (Analysis of Social Media), 10:30-11:50pm Tues & Thus, 4303 Gates Building.
- Fall 2012: 10-915, the MLD Journal Club, 12-1:20pm Tue & Thu, 4101 Gates Building (with Roy Maxion).
- Spring 2012: Machine Learning with Large Datasets, Tues-Thurs 1:30-2:50pm, NSH 1305
- Fall 2011: Structured
Prediction for Language and Other Discrete Data (SPLODD-2011), ML
10-710 and LTI 11-763, Tues-Thursday 3:00-4:20 in Gates-Hillman 4211.
This is co-taught by myself and Noah Smith, and will include some
subjects from Information
Extraction and some from Language and Stats 2. A
machine learning course (10-701 or consent of the instructors) is a
prereq; we don't recommend that you take the course if you have
already taken Information Extraction or Language and Stats 2.
- Spring 2011: ML 10-802 and LTI 11-772 (Analysis of Social Media), 10:30-11:50pm Tues & Thus, 4303 Gates Building.
- Spring 2011: 10-915, the MLD Journal Club, 3-4pm Mon & Wed, 4101 Gates Building.
- Fall 2010: 10-707
(Information Extraction - cross-listed in LTI as 11-748),
1:30-2:50pm Mon & Wed, Gates 4101. The first class is 9/8, the
Wed after Labor Day, to allow incoming students time to attend the IC
- Spring 2010: 10-802 (Analysis of Social Media).
- Fall 2009: 10-707
(Information Extraction), 1:30-2:50pm Mon & Wed, 5222 Gates
- Spring 2008: 10-601 (Machine Learning)
with Tom Mitchell, on 3-4:30
Mon & Wed in Wean Hall 5409.
- Fall 2007: Analysis of Social
Media, Machine Learning 10-802 and LTI 11-772, with Natalie Glance
(of Google Pittsburgh) - a brand-new seminar course. 4:30-6:30
Tuesdays in Wean Hall 4623.
- Note: This site is the shattered remains of a once-beautiful wiki,
created by the students of 10-802, generously hosted for free by
ScribbleWiki, tragically lost (due
a combination of RAID drive failures and low-bidder backup schemes),
and then largely recovered using
from various internel caches and archives.
- Fall 2007: Current Topics
in Computational Biology (Journal Club), 02-701. (Announcements). Thursdays from 4:00-5:00 in 411
Mellon Institute (after Cell & Systems Modeling).
- Spring 2007: Information Extraction, Machine
Learning 10-707 and LTI 11-748 - back by popular demand for the first time since 2004!
- Fall 2006: Current Topics in Computational Biology (Journal Club), 02-701.
- Spring 2006: Read the Web, CALD 10-709.
- June 21,23,25, 2005: A mini-course on Minorthird. Materials are below.
- Slides, notes, and sample files from first
- Slides, notes, and sample files from second
- Powerpoint slides from third
- Jar file for minorThird, if you
only want to run the code, not compile it or read it.
The installation process here is:
- Install Java 1.4 or higher (actually, JRE is all you need).
- Download the jar for minorThird
and stick it in some directory.
- Optionally, download the sample data
repository and unpack it into the same directory.
- Change to that same directory and
then run Minorthird with the command
java -Xmx500M -jar minorthird.jar
What will pop up will be a small launch pad that can be used to
start any of the UI programs. You can also start a particular
main by specifying minorthird.jar as your classpath, for
java -Xmx500M -cp minorthird.jar edu.cmu.minorthird.ui.Help
- If you want to do a real install here's the home page on Sourceforge, and
a document on how to do a CVS
- Spring 2004: "Learning to Turn Words into Data:
Machine Learning Approaches to Information Extraction and Information Integration", CALD 10-707 and LTI 11-748.
Recent papers I'm keeping in HTML or PDF (which requires Adobe
Acrobat Reader to view). Older papers are mostly in Postscript.
For Windows, I use the GSView reader for
postscript. Most of these papers are viewable in several formats in
- Here's an RSS feed of my papers. (Note: the feed I had created with Dapper seems spam-infested now.)
Here's a pointer to my DBLP page.
- A Computer Scientist's Guide To Biology is no longer
available from this web page, but is now available from Springer. Here is a the TOC,
introduction, index, and a sample chapter, from a late draft of
the book; and also all the figures
from the book in PowerPoint and all the figures in
PDF. (The figures are a little prettier than the ones in the
final book, which is black and white, not color).
2006 Proceedings are available in print, for the true afficianado
of fine learning-related research. It's well worth the money for the
cover art alone (of course, all the papers are also available on-line
- Recent and selected publications. These
are some representative publications for which on-line copies can be
- All publications. Here is an more-or-less
complete chronological list of my publications. The bibliography
includes pointers to on-line versions when I can provide them, but
unfortunately copyright restrictions don't allow me to make all of my
publications available on-line. Of course, reprints are always
available from me on request.
- Publications by topic:
Ahn Hoang, visiting from Singapore Management University for
2012-2013 academic year.
- Ramnath Balasubramanyan, LTI PhD student
- Dana Movshovitz-Attias, CSD PhD student.
- William Yang Wang, LTI PhD student
- Bhavana Dalvi Mishra, LTI PhD student
(co-advised with Jamie Callan)
- Mahesh Joshi, LTI PhD student
(co-advised with Carolyn Rosé)
- Nan Li, CS PhD student
(co-advised with Ken Koedinger)
- Nozomi Nori, LTI PhD student, co-advised with Christos Faloutsos
- Tae Yano, LTI PhD student
(co-advised with Noah Smith)
- Malcolm Greaves, CSD undergraduate senior
- Katie Rivard Mazaitis, research programmer/analyst
- Frank Lin, (former LTI PhD student, now at Twitter)
- Ni Lao (former LTI PhD student, now at Google)
- Richard C. Wang,
(former LTI PhD student co-advised with Bob Frederking, now at Enfind).
- Andrew Arnold
(former MLD PhD student, now at TrexQuant)
- Noboru Matsuda
(former postdoc, co-supervised with Ken Koedinger,
now System Scientist in CMU's HCII)
- Einat Minkov
(former LTI PhD student, now at Haifa University)
- Vitor Rocha de Carvalho (former LTI PhD student, now at QualComm)
- Zhenzhen Kou (former MLD PhD student, now at Yahoo!)
- Ja-Hui Chang
(visiting faculty from National Central University, Taiwan, 2007-2008)
Chong Tat Chua (PhD student at Singapore Management University,
visited CMU for the academic year 2011-2012 in my group.)
- Gustavo Lacerda
(former research assistant, co-supervised with Noboru Matsuda and Ken Koedinger, now at UBC)
- Ramesh Nallapati
(former postdoc, co-supervised with John Lafferty, now at IBM Watson)
- Edoardo Airoldi
(former MLD/Stats PhD student, co-advised with Steve Fienberg)
- Pradeep Ravikumar
(former MLD PhD student, co-advised with Steve Fienberg)
- I have been an external committee member for the PhD theses of
I have also been an external committee member for the Master's theses of
Mehrbod Sharifi (CMU) and
Weam Abu-Zaki (CMU).
I am currently an external committee member for Justin Betteridge,
Qirong Ho, Shreejoy Tripathy, YiChia Wang (all at CMU); and Freddy
Chong Tat Chua (at Singapore Management University).
- I also have collaborated recently and frequently with Tom Mitchell, Bob Murphy,
and Anthony Tomasic.
Machine Learning Department
Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213
8217 Gates Hillman Complex
(shipping address: 6105 Gates Hillman Complex)
voice: 412-268-7664 / fax: 412-268-2205
Assistant: Sharon Cavlovich, email@example.com, 412-268-5196
Official CMU Contact Info
My preferred email address is: wcohen AT cs DOT cmu DOT edu
For those many friends whose research I have built on, be warned.
My full name, "William Weston Cohen", is an anagram of the phrase "I
now cite shallow men". (From Sara Cohen - no
relation! - comes this warning: "Women's rights activists would
probably request you to use the following anagram instead: 'I shall
now cite women'".)
I am often praised for my highly artistic and functional web site
designs. An example is the site for SC Indexing, a professional book
indexer. However, I accept few clients - this one happens to be
Through my advisor, Alex Borgida, I can trace my "academic lineage" back to luminaries like
Leibniz, Newton and Alfred Whitehead.
When I'm not working my day job, I avoid productive behavior by playing music.