Code and Data




Link Data Sets







NOTE: These datasets were designed for experiments in "Finding Underlying Connections: A Fast Graph-Based Method for Link Analysis and Collaboration Queries". They are also introduced and described in the paper. We request you cite the above paper if you use the datasets in your own papers. The bibtex for the paper can be found here.

NOTE: The data sets provided here are in the old link data set format. The GDA and cGraph programs on this site now use a new link data format. Thus the data sets must be converted.

Files:
  • lab.zip - Co-publication data from the Auton Lab at Carnegie Mellon University.
  • institute.zip - Links of three different types (co-publication, common research interest, and advisor/advisee) that was collected from public data on Carnegie Mellon University Robotic Institute's webpages.
  • manual.zip - Links created by a human who manually read a set of public web pages and news stories related to terrorism and subjectively linked entities mentioned in the articles.
  • citeseer.zip - Co-publication data from citeseer.com (coming soon - pending permission).
  • imdb.zip - Movie information from www.imdb.com (coming soon - pending permission).
Description: A variety of link data sets.

Past Usage: Format:

The Names File:
The names file contains each entity in the data set and any related demographics information. Each line contains the information for a single entity in a comma seperated list. The first row contains the column labels (the first of which must be the "name" column). There are two names file for each data set:
  • dataset_names.txt - a names file for the filtered data set.
  • dataset-dems.txt - a names file for the unfiltered data set.
An example names file might look like:

name
aaa
bbb
ccc


The Links File:
The links file contains the set of links. Each line consists of a single link. The format for the link is:

UNIQUE_LINK_ID,LINK_TYPE,ENTITY1,ENTITY2,...

There are several links file for each data set:
  • dataset_filtered_links.txt - a links file for the filtered data set.
  • dataset-links.txt - a names file for the unfiltered data set.
An example links file might look like:

link01,linktype1,aaa,bbb
link02,linktype2,aaa,bbb,ccc
link03,linktype2,aaa,bbb
link04,linktype1,aaa,ccc


Additional Notes: n/a
Last Update: 10-22-2003





Return to the code page.