The Problem: Many different people have the same name. In a normal setting, people often use background knowledge to disambiguate different people from a common name. But in the information age today, a database can contain thousands of common names. This will make obtaining background information of every common name unfeasible. However, more information can be obtained from the context surrounding the common name. Suppose the information about relationships between each name is given, partitions can be formed from among the associates of the common name. The assumption is that if different people uses a common name, there will be distinct partitions among the associates. For example, the name Michael Jordan is used by many different people. Some will know Michael as a basketball star, and some will know it as a statistic professor. Suppose the background information (basketball star, statistic professor) is not given, disambiguation still can be done by knowing about the associates of the name. Michael Jordan, as a basketball star, will be closely associated with Scottie Pippen and Washington Wizards and other basketball entities. But Michael Jordan, as a statistic professor, will be closely associated with names of graduate students and U.C. Berkeley. Just by observing the associates, there will be a clear partition between the associates of the basketball star versus the associates of the professor. Approach: In all the approaches, the input is a symmetric matrix with each row or each column representing an associate of the name in question. And the number inside each cell of the matrix represents how frequent the two associates occur together in a given context. For example, if the dataset is newspaper articles, row i can be George W. Bush, column j can be Saddam Hussein, and cell(i,j) is how many times these two associates appear together in newspaper articles. Note that both Bush and Saddam can appear without the name in question. This matrix can also be represented as a weighted graph with the vertices being associates and the weighted edges being the relationships between any two associates. The first approach is to select two associates that will split all the associates into two clusters, with each selected associate in a separate cluster. And from these two selection, probabilities that each associates belongs to each cluster can be calculated iteratively. The second approach also starts with the selection of the two associates. However, the other associates will get assigned to each cluster base on their distance to the two selected associates. Distance is defined as the length of the shortest path to the selected associate in the graph. The third approach is spectral clustering. First the matrix can be scaled down to two dimension by projecting it onto the first two eigen vectors. Then all the points (associates) gets normalized onto a unit sphere. Finally, Kmeans is used to cluster the projected and normalized points into two clusters. Testing: -#associates These three approaches have been tested on datasets from movie databases, news articles, research papers, cocktail mixes, and CMU Robotics Institute web pages. To simulate two different individuals having the common name, two random individuals are selected and merged into one individual. Then all the relationships of these two individuals are collected. Each relationship contains the individual and all its associates. From these relationships, all the associates of these two are extracted. For example, relationships can be new articles and the two individuals can be both Michael Jordans. All the associates are Scottie Pippen, U.C. Berkeley, and so on. As described above, each approach will assign clusters for each associate. Now a relationship that contained the name in question can also be assigned a cluster by the associates contained in the relationship. So all the relationships can be partition into two clusters. Now the merge of two individuals can be undone to see how well the clustering performed. Back to our example, if the clustering was complete successful, all the news articles that contain Michael Jordan as a basketball star will be in one cluster. And all the news articles that contain Michael as a professor will be in another cluster. If the clustering was complete random, each cluster will contain some basketball articles and some statistics articles. Scoring is done by testing how "pure" are each clusters. 50% means complete random and 100% means perfect clustering. Note that because the way the test was setup, scores cannot go below 50%. Max associates means only the top associates were chosen for that test. Each test score is the average of one hundred runs. 10 100 1000 10k Dan actors-40k .58 .61 .78 .78 news .50 .54 .67 .69 citseer .61 .74 .75 .75 drinks .54 .52 .52 .52 ri .54 .73 .74 .74 Shortdist actors-40k .50 .54 .78 .78 news .50 .55 .59 .59 citseer .58 .72 .72 .72 drinks .55 .63 .62 .62 ri .51 .64 .66 .66 Spectral actors .51 .55 .58 .60 news .51 .54 .59 DNE citeseer .55 .64 .64 .64 drinks .55 .65 .65 .65 ri .54 .67 .65 .65