Jorgen’s Weblog: Recommendations Based on User Ratings

Systems that allow users to rate items are becoming more and more important. An increasing amount of information available requires better and better ways of filtering. “Crowdsourcing” that filtering to a large group of users is a common approach, and rating systems facilitate that. One important aspect of rating systems is the number of options they allow. The following will compare these rating systems and argue that the simple two- or even one-option systems should be the preferred choice.

Overview

Sites that provide users the ability to filter among large amounts of data are numerous. Most well-known sites use this in some form another. This ranges from improving search results using +1 buttons over hiding comments if users rate them badly to displaying items for sale depending on user ratings.

As numerous as the sites and applications, as numerous are the ways the rating systems work. The main question I want to answer is how many options such a rating tool should have for best effect. Part of that is also to figure out what “best” means here.

The context is a site where users can publish items they own or bought and rate them. The site then can recommend new items to users based on their ratings and the ratings of others. A basic algorithm for this would calculate the distance between two users, and pick items from close users that are not in the current user’s list.

Typical Rating Systems

The following are typical rating systems in use on the net.

Multi-Point Scales: These scales allow the rating of items along a discrete set of points. An odd number of points allows for a neutral response, while an even number forces to pick a side. Typical cases are 5-point systems as used by Amazon or GoodReads, 7-point systems as used by many surveys, and even 10-point systems as used by IMDb.
Like/Dislike Scales: Another common approach is to reduce the number of options to two, simply either saying whether the user likes this item or not. The user interface is simplified a lot compared to the multi-point scales. These are used on sites such as Stack Overflow, but also for Amazon reviews.
Like Scales: Simplified even more, Google+ and facebook reduce the user’s comment to a singular position of either liking an item, or not saying anything about it.

The following sections will compare these general approaches from different angles.

Selection Bias

Voluntary rating systems have a strong inherent selection bias. First of all, it requires a certain amount of emotional involvement to go to a site and rate an item. This makes middle ground votes less and extreme ratings more likely.

Additionally, for items that are on sale, the raters usually had to buy it in the first place. Usually, items are bought only after a certain investigation, making it much more likely that the items are liked by the people who rate them.

This goes as far as for Amazon, simply buying an item is already grounds for recommending that item to those who bought similar items.

Another typical effect is exhibited by the 4- and 5-star ratings, which generally are quite interchangeable there. Some users use 4 stars for “good item” and reserve 5 stars for “absolutely awesome.” Others use 5 stars for generally good items and use 4 stars for items that are still good, but have some specific drawbacks.

This is compounded by Amazon considering a 3-star review as critical already. Due to this, the more numerous positive results have to fit within 2 scales where users would likely prefer 3 (“awesome”, “normal good” and “good with some drawbacks”). The more rare negative reviews generally fall into two categories only (“bad, but has some redeeming factors” and “abysmally bad”), but have three options to choose from.

For the purpose of calculating the distance between the taste of two users, this choice of scale reduces inter-rater reliably a lot. It is completely unclear if any combination of 4 and 5 ratings mean that two people are any more near each other than other such combinations.

Hence, it would seem that 5-scale ratings are suboptimal for this kind of rating. 6- or 7-scale ratings would perform better, as they likely allow a better separation of positive reviews. Alternatively, reducing the scales further to the 2- or 1-scale options would also reduce the confusion regarding positive ratings.

Peculiarities of Taste

Measuring taste is an inherently fuzzy task, and that is what we are after. Even assuming a perfect measuring system and finding two identical users with the current item selection would not necessarily mean that further items liked by one user would be liked equally by the other. The correlation merely gives an increased likelihood.

Especially considering how the results are meant to be displayed. There is no single correct answer. The correlation is meant to order the set of possible items, not to pick a single most appropriate one. If the exact order for any two items can not be determined reliably with a more exact scale, but a less exact scale still puts more interesting items near the top of the list, the less exact scale is not a lot worse for the purpose of recommendation.

This would seem to indicate that large number of different options only fake a great accuracy for this purpose. Due to the much simpler user interface, a 2-option or 1-option rating system seems very tempting here.

I would be tremendously curious about doing an actual study on this.

Personal Dislikes and Bullying

A recurring problem of negative options in a voting system is a bullying effect, where personal dislikes towards a specific person cause others to “negative-vote” item related to that person.

If a rating system is used within a social context, this may be a serious concern and leaving out the negative voting option altogether might be preferable.

Conclusion

Multi-point ratings are widespread, but their benefits are quite unclear. The selection bias inherent in public rating tends to push possible ratings to the positive side, making it difficult to distinguish the positive ratings compared to the negative ones. Additionally, the inherent fuzziness of taste means that, for the purpose of recommending items to other users, the differences between close rating values has low relevance.

A simple “like”/“dislike” or even just a “like” interface seems to have no significant disadvantages for the purpose of generating recommendations. It does have definite advantages regarding user interface simplicity, though.

Dolnicar et. al. (2011) have collected quite some research that would agree with this point. Their references would present a good starting point for further research.

Jorgen’s Weblog

Own JS/CSS options

Saturday, February 25, 2012

Recommendations Based on User Ratings