Frankowski et al. SIGIR 2006
From ScribbleWiki: Analysis of Social Media
You Are What You Say: Privacy Risks of Public Mentions
Authors: Dan Frankowski, Dan Cosley, Shilad Sen, Loren Terveen and John Ried
This paper explores the re-identification problem of social media pages. The authors claimed that people can be identified by their preferences, such as the ranking for movies, wishing list at Amazon, or even what they talk about at their blog or reviews to a movie. Typically, re-identification is performed by linking two datasets with some common attributes of the two datasets. One of the datasets is anonymous, and the other is identified. By matching the common attributes, people can identify the anonymous dataset. This paper propose to represent the two dataset in a so called sparse relation space, and re-identification is done by matching the common items in the spaces.
1. Risk of dataset release. What are the risks to user privacy when releasing a dataset?
2. Altering the dataset. How can dataset owners alter the dataset they release to preserve user privacy?
3. Self defense. How can users protect their own privacy?
Sparse Relation Space
A sparse relation space is a space that i) relates people to items, ii) is sparse, haveing relatively few relationships recorded per person, iii) involves a relatively large space of items. Examples of sparse relation spaces include custormer purchase data from Target stores, music played on an online music player like iTunes, articles edited in Wikipedia. The authors claimed that any dataset that can be represented as a sparse relation space is volnurable for privacy issues. Privacy loss may occur when we have an anonymous sparse relation space and an identified sparse relation space.
Datasets are collected from MovieLens recommender: as set of movie ratings and a set of movie mentions. At MovieLens, the ratings are not allowed to be identified with users. The rating dataset is released by MovieLens for research use with anonymizing the identities. The movie mention dataset is public on the web set, and identified. Figure 2 shows a snapshot of Movielens.
Characteristic of the dataset
Both ratings and mention dataset follows a power low. Figure 3 shows the number of ratings for items. From the diagram, we can see that there are many rarely rated items, which are important in the re-identification algorithms.
k-identified: the user who is in the top k candidate identities returned from the re-identification algorithm given the user.
k-identification rate: a measure of how well an algorithm can narrow each user in a dataset to one of k users in another dataset. It is computed as the fraction of k-identified users.
Algorithms for re-identification
Both of the two datasets can be represented with sparse relation spaces. The items for the rating dataset are the movies rated by a user, and the items for the mention dataset are the movies mentioned by a user. Several re-identification algorithms are proposed.
1. Set Intersection Algorithm. The users in the ratings dataset who are related to every item in the mentioned by the target user are extracted as the candidate identities. One problem is that user may mention items that they did not rate.
2. TF-IDF Algorithm. Use TF-IDF to measure the sparse relation spaces. Each user of the rating dataset has a vector of rated movies, and each user of the mention dataset has a vector of mentioned movies. Cosine similarity is used to calculate the similarity between two vectors. The most similary users are candidate users for the target user.
3. Scoring Algorithm. Since the rarely rated movies are very important, the score emphasizes mentions of rarely rated movies, and de-emphasizes the number of ratings a user has. If a user has rated a mention, the score for the mention is the fraction of users who have not rated the mention, if the user did not rate the mention, the score is 0.05. The score for a user is the multiplication of the scores of the mentions. Users with top k scores are candidates.
4. Scoring Algorithm with Ratings. Algorithms 1-3 did not use the value of the ratings. Suppose we have a magic analyzer, which can analyze the rating a user has given from the mention of a movie of that user. The scoring algorithm is modified as only when the estimated rating from a mention is the same as the rating in the item, the score for the mention is the fraction of users who have not rated the mention, otherwise 0.05.
Experiment Results for Re-identification
The result showed that estimating the rating values improves the performance a lot. The more movies the user mentioned, the more easier the user is identified. 60% of users who mentioned at least 8 movies can be 1-identified with the scoring algorithm, and nearly 90% when knowing the exact rating values.
Alter the dataset
An algorithms is proposed and evaluated for how to alter the dataset to prevent re-identification.
Data Suppression: Drop the rarely rated items. The experiment results showed that we need to drop 80% of the dataset in order that none of the users can be 1-identified. This algorithm can be quite harmful for research purpose.
Two algorithms are proposed and evaluated for how to defense not to be identified from the users.
1. Suppression: try not to mention the movies that are rarely rated. The experiment result shows that even dropping 50% of the mentions, we cannot make sure that the user cannot be 1-identified.
2. Misdirection: mention items that they have not rated. The experiment result showed that by mentioning the popular movies, the 1-identification possibility reduces greatly. With mentioning 10 popular movies, the 1-identification possibility is less than 1%.