tag:blogger.com,1999:blog-2316765421340036602.post3729920548061228057..comments2024-01-10T11:38:15.547-08:00Comments on Random Observations: Finding related itemsBen Tillyhttp://www.blogger.com/profile/04335648152419715383noreply@blogger.comBlogger2125tag:blogger.com,1999:blog-2316765421340036602.post-68513444698890688912011-02-27T13:42:45.130-08:002011-02-27T13:42:45.130-08:00I did think about using log(n * m)/log(s), but tha...I did think about using log(n * m)/log(s), but that runs into the serious problem that if A and B have each been tagged with an item once, then the weight of that relationship is 0. Which is obviously not the desired behavior. It is easy to fix that issue, but you have an infinite number of possible variations.<br /><br />What really should be done is that various guesses should be A/B tested against each other. However setting up that particular A/B test would be rather complex, and nobody ever did it.<br /><br />Incidentally reading through my notes again, it seems that the original program was pushing 8 hours, not 2 hours.Ben Tillyhttps://www.blogger.com/profile/04335648152419715383noreply@blogger.comtag:blogger.com,1999:blog-2316765421340036602.post-45629084276981582842011-02-27T12:41:51.330-08:002011-02-27T12:41:51.330-08:00I saw the same thing when I was doing the Netflix ...I saw the same thing when I was doing the Netflix Prize. It had only 100 million items but database was already too slow. One of the first thing I tried was importing the data into a database, but I found doing any sort of nontrivial computation just took too long with data in the database.<br /><br />Your item relationship/correlation problem is very interesting, it's like a combination of what we called regular correlation and "binary correlation". I wonder if you tried log(n * m)/log(s) instead of log(n + m)/log(s) ? Imagine a tag has been applied once to item A and 99 times to item B, but 50 times each to item C and D, obviously C and D should be more closely related than A and B, but n+m would give you the same result. Of course this is only my hunch, the data may prove otherwise.by321https://www.blogger.com/profile/14537616692777428127noreply@blogger.com