Collaborative Filtering Task

The task of collaborative filtering is the task of predicting the preference a user assigns to items based on preference data of that users and preference data of other users. The user whose predictions are being predicted is termed the active user, while the rest of the users are the non-active users. The set of items that the active user reports his preferences on are terms the reported items and the remaining items are termed the predicted items. The users preference may be explicitly entered by the users as in EachMovie or the Microsoft songs recommendation dataset. In this case, the preference is usually encoded by a numeric score such as 1,2,3,4,5. A different setting is that of implicit preference data. In this case, the users do not explicitly enter their preference but their actions are recorded and interpreted as preference assertions. For example, web-log data that includes surfing information of people may be interpreted as visit a web page = preference is 1and not visiting a web page = preference is 0. These preferences may then be analyzed to suggest web sites to users that they might be interested in browsing. The interpretation of implicit data may be criticized on different grounds, but it remains a potentially powerful application.

The collaborative filtering system (CF) usually reports its output in two different forms. If the user reports his preferences using numeric values the system may try to predict the numeric value associated with the predicted items. This may be reported as a probability vector for each predicted item or summarized by statistics such as the mean and variance of the probability vector. Alternatively, the predicted items may be sorted as a list and presented to the active user without the predicted numeric scores. In the first case, performance is often measured by computing a normed distance between the mean of the predicted numeric scores and the actual preference values. Of course, if the true preference of the active user is reported only for a subset of the predicted items, then the distance has to be computed based on that subset alone. In the second case, the ranked list may be evaluated using expected utility of a user selecting items from that list with probability exponentially decreasing in the rank of the item.

Traditionally, CF systems completely ignore any content information about the items. For example in the case of movie recommendation, movie features such as length, language etc. are ignored. The predictions for the active user are made only on the basis of his and other users predictions. The system is ignorant as to whether the items being recommended are movies, web pages, songs or anything else. Recently, some attempts have been made at incorporating item features into CF system. This is an interesting prospect that seems to increase the recommendation performance. Collaborative filtering (with and without item features) is a popular research topic whose state-of-the-art rapidly changes and that has significant potential for commercial applications.