If you talk to the people who design systems to produce recommendations tailored to user preferences, you'll see lots of impressively daunting mathematics, with formulae for measuring things like "Mean Absolute Error".
If you talk to the people who sell these systems, you'll hear stories of uncannily perceptive and prescient suggestions that anticipated interests that Jo Consumer didn't even know she had yet.
The information that neither tend to volunteer until asked is that there are no agreed criteria for a perfect, or even a good, recommendation.
Hence you will see reviews of different recommender systems where people put the same starting point into different systems and then compare the results to judge which is the best. But the judgements, like this one, are based on the subjective views of experts: they have no way of articulating what's behind those judgements in a way that could be replicated in other studies. And how reliable are those judgements? Well, when Paul Lamere asked the readers of his blog to compare two sets of recommendations and identify which was done by an expert reviewer and which by machine, over three quarters of the respondents got it wrong. That means if you'd tossed a coin you could have done better than asking these people. (I was one of the few who got it right, but I had already said that there was no way to determine which was which with any confidence, therefore my answer was a guess, and it turned out to be a lucky one.)
So what is a good recommendation, and can we measure value or 'accuracy' at all?
Crucially the answers depend on context. Who are you recommending to? What level of expertise do they already have in the area, and how keen are they to 'experiment' with novelty? What are you expecting them to do with your recommendations — sample them or buy them? What is the domain of the recommendations, and what parameters does it come with? John Riedl tells a story of a recommender system that said "people who bought leather gloves like yours also bought… a Mercedes". This may be statistically accurate and correct, but something tells us it's unlikely to be appropriate to the context.
When I did an apprenticeship in psychometrics a couple of decades ago, I learnt about the importance of predictive validity for psychometric tests. If there is evidence that "people who perform well on this aptitude/personality test go on to get good on-the-job performance ratings in roles like the ones you're recruiting for", then Human Resources departments will be more likely to use those tests. It's all about correlations, and it's not very exact — not least because we know that on-the-job ratings are frequently less than perfect — but it gives you an indication that this stuff works.
How could you apply predictive validity to recommendations? Well Amazon must surely record which of its recommendations lead to purchases. I imagine they tweak their recommendation algorithms all the time to see how this affects the proportion of suggestions that generate sales. However, purchasing is only a crude binary measure: either you buy something or you don't. On a music subscription service you've got many more fine-grained measures. For each track or artist that you recommend, you can see how many times it is played and over what period (does it get played ten times and then forgotten for ever, or does it gradually become a long-term favourite?). Film services might be able to get some similar data, but are constrained by the fact that most people watch most movies just once. And with books, as long as I stick to the analogue physical format, I can read a book as many times and as often as I like without it showing on the digital radar. As for gloves and Mercedes…
I hope there may be some research on these kind of measures at the recently-announced Recommender Systems 2008 conference, because I missed this kind of thing at the last one.
In the absence of good measures of the validity of recommendations, I get the impression that most commercial recommender systems are operating on a fairly pragmatic approach. Their salespeople may swear blind that theirs is the best, most telepathic, most wisdom-of-the-crowds-and-the-experts-and-the-runes, but what really counts is that it's just good enough to keep people coming back for more recommendations.
In this sense, recommender systems are principally a device to make a site or service more sticky, providing the opportunity to generate more revenue from sales, affiliate links and advertising.
And finally, that image at the top is a genuine screenshot from Amazon (not the result of creative image editing). Since I saw it, I've bought that Harmonia album, and I'm happy to endorse its recommendation.
Recent Comments