Sunday, February 13, 2011

Finding related items

I've mentioned in the past (not in this blog) that C++ programs can beat databases by many, many orders of magnitude. I learned this while doing a specific fun project. I have permission from the company to discuss that project, and here is a description of that project.

I was working for a contract for ThisNext. Their database has a large number of records of different items, which a large number of people have tagged with arbitrary tags. Think millions of items, millions of distinct tags, and hundreds of millions of times that tags were applied to items. (Often the same tag has been applied to the same item multiple times.) Their idea was that this data could be used to discover which other items are somehow "related" to the current one, and possibly of interest if you find the current one interesting. This is used to populate related items, which is intentionally somewhat whimsical. For instance if you are interested in a pair of silk hotpants then perhaps you might be interested in velvet shorts, lumberjack hotpants or a Bo Peep Jumpsuit.

The suggestions are intentionally somewhat whimsical, and are generated by user activity. The idea behind them is that two items with tags in comment are likely to be related. The more common the tag, the less related they are. For every item they wanted the 30 most related items, sorted in the order of how related they are.

There is an obvious computational problem that you will face. Suppose a popular tag was applied to 30,000 different items. Then for 30,000 items you're going to have to compare to 30,000 other items, which gives you 900,000,000 relationships. This clearly is a scalability problem. So their initial version had implemented this as they just counted the number of tags in common between items, and they skipped any tag that was used over 500 times.

This originally worked fairly well, but as time went by it slowly got worse. The first problem is that their program slowed down as they got more data. When I arrived it was taking a couple of hours to run and was affecting when they could start running backups. The second problem was that if you had an item that had only been tagged once with a popular tag, such as "sunglasses", then no suggestions would be generated for it. They wanted me to try to solve one or both, preferably both, problems.

My first step was to try to figure out the ideal result. And then I was going to try to optimize that. So my first step was to try to find a smooth function that worked over multiple orders of magnitude, that captured the idea that a common tag was less of a relationship than an uncommon tag. The phrase "multiple orders of magnitude" made me think of logarithms, and so I decided that if a tag had been used n times for one item, m times for the other, and s times all told, then log(n + m)/log(s) would be the strength of that relation. The total weight of the connection between two items, I decided, should be the sum of the weights from the shared tags. I decided to break ties in favor of the most tagged other items. (Thinking that this was evidence of popularity, and popular items were probably better.) And any remaining ties in favor of the most recently created item. (Being tagged a lot in a short time is also evidence of popularity.)

Note that this formula is reasonable, but arbitrary. It was chosen not out of any knowledge of what the "right" answer is, but rather as the simplest reasonable formula that looked like it reflected all of the ideas that they wanted to capture.

Next came finding a potentially efficient way to implement it. My idea was that I could do it with two passes. In the first pass I'd compare each item with up to the top 100 items withing a tag. Then I'd sort and find the top 100 related items. In the second pass I'd figure out the full relationship between those items, and sort in the final order. This would prevent the quadratic performance problem that kept popular tags from being analyzed in the first implementation.

If you are familiar with databases then you may wonder how I was planning to pick the top 100 items from each tag, for all tags at once. If I was using Oracle I could just use analytic queries. That feature made it into the SQL 2003 standard, and so support for windowed functions is showing up in other major databases. But I wasn't using a database with support for that. I was using MySQL. However MySQL has an efficient hack for this problem which is described in http://www.xaprb.com/blog/2006/12/02/how-to-number-rows-in-mysql/. (If you're doing a lot of work with MySQL, that hack is worth knowing about.) With that hack it is easy to have a query that will return the top 100 items for each tag.

So I coded up my program as a series of MySQL queries. When it came to actually processing the items there was no way that there was space for the database to handle it in one big query, so I opted for having a temp table into which I stuck ranges of items, which I'd then process in batches. The program ran..but it took 5 days to finish. This obviously was not going to be good enough. I thought about it long and hard, and decided that the best way to optimize would be to try to rewrite it in something other than SQL.

Normally I'd use Perl, but Perl wastes memory fairly liberally, so I was sure that that it couldn't store this dataset in RAM. I am not a C++ programmer, but I noticed is that C++ has vectors. Vectors support the same basic operations that I am used to in Perl. But (unlike Perl arrays) they are memory efficient and very fast. So I talked to the company, got permission to try rewriting this in C++, then went to work.

According to my memory, I used 4 data structures.
  1. item: A vector of structs. Each struct said how often the item had been tagged, where the item_tags started, and the number of item_tags there were. (The item_id was your position in the vector.)
  2. item_tag: A vector of structs. Each struct knew the item, the tag, and how many times that tag had been applied to that item. It was sorted by item_id then tag_id.
  3. tag: A vector of structs. Each struct knew the number of times that tag had been tagged, where the tag_items started, and the number of tag_items there were.
  4. tag_item: A vector of structs which is exactly like item_tag but this one was sorted by tag and then item.

Again I wrote my program in 2 passes exactly like before.

For each item the first pass was meant to identify which other items might possibly be the most related. Here it is in pseudocode:

foreach tag for that item (found in item_tag, the range is from item_tag):
from tag figure out the range of tag_items I was interested in
(the size of the range was a constant, other_items_limit)
foreach other item found in tag_items:
if item and the other item differ:
save the existence of that relationship in relationships
(a relationship contains the other item id, the weight, and
the popularity of the other tag = how many times it was tagged)

sort relationships by other item id

aggregate relationships by other item id (aggregating weights across tags)

sort relationships by weight, times other item tagged, and other item id
(all descending)

pick top other_items_limit items for the second pass

The purpose of the second pass is to find the final list of other items. Here it is in pseudocode:

foreach other item found in the first pass:
weight := 0
traverse item_tag in parallel for both:
if a common tag is found:
add to the calculated weight
save the relationship

sort relationships by weight, times other item tagged, and other item id
(all descending)
for the top 30 relationships:
print this relationship to a file


Remember that the original SQL version took 5 days. The C++ program took under 20 minutes. I got rid of a details file that wasn't actually needed except for debugging, and I stored all of the logarithms that I needed in a vector so I could do an index lookup rather than an expensive floating point calculation. Total run time dropped to under 5 minutes. I left in various assertions that had been a development aid, and called it a job well done. Together with the steps to run queries, extract a dataset, run the program, and do the final upload, end to end I was generating the full result set in something like 15 minutes.

This astounded me. I had good reason to believe that C++ was going to be faster than the database. But 3 orders of magnitude faster? That I did not expect!

I believe that the biggest single difference is the work you need to do to access data.

What is the fastest way to get at an isolated piece of information in the database? You do an index lookup. Meaning that you do a binary search through a tree that then tells you what database row has your information and what page it is to be found on. (Don't forget that walking a binary tree means multiple comparisons, if/then checks, etc.) You then read that page of memory, parse through it to find the row you want, parse the information you want out of that row and return it. Note that the instructions for what exactly to do is likely represented in some sort of byte code that has to be interpreted on the fly.

In my C++ program the equivalent operation is, "Access vector by index." Is there really any surprise that this is several orders of magnitude faster?

There are other differences as well. In the database I was working hard to keep batches small enough that the temporary tables I was throwing around would fit in RAM. In C++ a good deal of the time the next piece of information I wanted would be in L1 or L2 cache inside of the CPU. (The fact that I was walking through data linearly helps here.) MySQL likes to represent numeric types in inefficient ways, while the C++ program uses faster native data types. MySQL had to calculate a lot of logarithms, which was work I mostly avoided in C++. These things all helped.

It is also possible that I had some fixable performance problem in my SQL solution. But in discussing this project I have encountered a number of other people who have had similar experiences. And the general order of magnitude of speedup that I encountered is apparently not particularly rare when going from SQL to C++.

However despite this experience, I remain a big proponent of pushing work to the database and not worrying about performance unless you need to. Why? Because I've pushed databases much harder than most people do. Only a handful of times have I hit their limits. And until I hit those limits, they do a lot to simplify my life, and speed development.

But I am now aware exactly how many orders of magnitude faster I can make things by rewriting in C++. But with a giant corresponding increase in development time and required testing before you know you have correct results.

A final note. This project was fun for me, but what did ThisNext think? They were delighted! I solved their batch processing performance problem. In testing the suggestions produced were clearly better. And based on traffic and revenue figures following the release of my change, I was told that I had increased revenue by at least 10%.

2 comments:

by321 said...

I saw the same thing when I was doing the Netflix Prize. It had only 100 million items but database was already too slow. One of the first thing I tried was importing the data into a database, but I found doing any sort of nontrivial computation just took too long with data in the database.

Your item relationship/correlation problem is very interesting, it's like a combination of what we called regular correlation and "binary correlation". I wonder if you tried log(n * m)/log(s) instead of log(n + m)/log(s) ? Imagine a tag has been applied once to item A and 99 times to item B, but 50 times each to item C and D, obviously C and D should be more closely related than A and B, but n+m would give you the same result. Of course this is only my hunch, the data may prove otherwise.

Ben Tilly said...

I did think about using log(n * m)/log(s), but that runs into the serious problem that if A and B have each been tagged with an item once, then the weight of that relationship is 0. Which is obviously not the desired behavior. It is easy to fix that issue, but you have an infinite number of possible variations.

What really should be done is that various guesses should be A/B tested against each other. However setting up that particular A/B test would be rather complex, and nobody ever did it.

Incidentally reading through my notes again, it seems that the original program was pushing 8 hours, not 2 hours.