Joost de Wit graduates on evaluating recommender systems

Recommender systems use knowledge about a user’s preferences (and those of others) to recommend them items that they are likely to enjoy. Recommender system evaluation has proven to be challenging since a recommender system’s performance depends on, and is influenced by many factors. The data set on which a recommender system operates for example has great influence on its performance. Furthermore, the goal for which a system is evaluated may differ and therefore require different evaluation approaches. Another issue is that the quality of a system recorded by the evaluation is only a snapshot in time since it may change gradually. Although there exists no consensus among researchers on what recommender system’s attributes to evaluate, accuracy is by far the most popular dimension to measure. However, some researchers believe that user satisfaction is the most important quality attribute of a recommender and that greater user satisfaction is not achieved by an ever increasing accuracy. Other dimensions for recommender system evaluation that are described in literature are coverage, confidence, diversity, learning rate, novelty and serendipity. It is believed that these dimensions contribute in some way to the user satisfaction achieved by a recommender system.

Joost performed a user study for which 133 people subscribed to an evaluation application specially designed and build for this purpose. The user study consisted of two phases. During the first phase users had to rate TV programmes they were familiar with or that they recently watched. This phase resulted in 36.353 programme ratings for 7.844 TV programmes. Based on this data, the recommender system that was part of the evaluation application could start generating recommendations. In phase two of the study the application displayed recommendations for tonight’s TV programmes to its users. These recommendation lists were deliberately varied with respect to the accuracy, diversity, novelty and serendipity dimensions. Another dimension that was altered was programme overlap. Users were asked to provide feedback on how satisfied they were with the list. Over a period of four weeks 70 users provided 9762 ratings for the recommendation lists. For each of the recommendation lists that were rated in the second phase of the user study, the five dimensions (accuracy, diversity, novelty and serendipity) were measured using 15 different metrics. For each of these metrics its correlation with user satisfaction was determined using Spearman’s rank correlation. These correlation coefficients indicate whether there exists a relation between that metric and user satisfaction and how strong this relation is. It appeared that accuracy is indeed the most important dimension in relation to user satisfaction. Other metrics that had a strong correlation were user’s diversity, series level diversity, user’s serendipity and effective overlap ratio. This indicates that diversity, serendipity and programme overlap are important dimensions as well, although to lesser extent.

[more info] [download pdf]