Detecting exceptions and fraud on recommender systems
I stumbled across this iLike profile when I noticed someone who'd listened to David Bowie over 2 billion times. "Patrick S" has, according to his profile, been alive less than a billion seconds (and iTunes has been available for only the later part of his life), yet he's managed to register over 2 billion plays for 32 separate artists. The data concerned seems to be imported by iLike from iTunes.
Who knows whether this is just idle mischief, a software glitch, or a concerted effort at gaming or 'shilling' the iLike recommender system? Whatever the cause, recommender systems clearly have to detect such incongruent profiles — which can't be hard in cases as gross as this — and make sure the data is excluded from their recommendation algorithms. Cleaning up the profiles for this user and the 32 artists concerned would be a good idea, too, lest they undermine others' confidence in what they see on the site.
Note also that the listen count of 2,147,483,647 happens to be 2^31 - 1. This is a significant number since it is the largest signed value that can be represented in a 32 bit integer.
I suspect that this is idle mischief. You can follow the trail as patrick s experiments trying to find the largest playcounts:
Joey Schlabs 2,147,483,646
Paloalto 2,113,430,281
Andre 3000 2,112,545,318
Seven Band 2,112,303,498
Nick Cave & The Bad Seeds 2,108,076,443
Elvis Presley 2,104,761,844
Perhaps more troubling is that iLike doesn't catch this and filter out such an obvious hack. Now if you look at the fans of John Mellencamp http://www.ilike.com/artist/John+Mellencamp/top_listeners all you see is patrick_s, the rest of the fans are normalized to zero. That's a bad user experience.
(Note that I tried the same thing a while back to see what last.fm would do when I told it I had played Deerhoof a modest 200,000 times, they filtered it out).
Posted by: Paul | 11 December 2007 at 01:44 PM
Moreover, if iLike gets the data from the iTunes personal library, one only has to open the associated XML file, and write down the desired number of plays per song.
That's it!
Posted by: oscar | 11 December 2007 at 04:34 PM
Thanks for this input.
Ah, Paul, some people recognise 2^31 - 1 immediately, and some don't ;-)
Oscar, I just tried editing the play count from the XML file as you describe, but iTunes seemed to outwit me when I started it up again and set it back to what it had been before. I suspect it's simple, but not quite that simple (for me!).
Posted by: | 11 December 2007 at 05:06 PM
It's a good illustration of how automated recommendation systems lack the most basic common sense. Machines are great at spotting patterns and apparent associations, but their recommendations are at best automated guesses (albeit sometimes very good guesses). When a friend recommends something, though, I know it's going to be good - there's a bit more substance to the recommendation than a mathematical algorithm.
Posted by: Sean McManus | 11 December 2007 at 06:26 PM
Funny, I noticed a similar glitch on a blog traffic site about 2 years ago. Not an automated recommendation system, but something used similarly.
The Truth Laid Bare used to be a worthwhile way to compare blog traffic -- until a few people figured out how to get multiple blogs. But what struck me about your post was how similar the two data runs looked:
Here's the faked misleading traffic data:
28) Athletics Nation :: An Oakland A's Blog 40455 visits/day
29) Red Reporter :: A Cincinnati Reds Blog 40455 visits/day
30) Bruins Nation :: A UCLA Bruins weblog 40455 visits/day
31) Camden Chat :: A Baltimore Orioles Blog 40455 visits/day
On and on for a dozen sites -- it struck me similarly to your list above.
The full fisking is here: http://bigpicture.typepad.com/comments/2006/02/gaming_the_blog.html
Posted by: Barry Ritholtz | 24 January 2008 at 11:55 AM