Random Observations: databases

I've mentioned in the past (not in this blog) that C++ programs can beat databases by many, many orders of magnitude. I learned this while doing a specific fun project. I have permission from the company to discuss that project, and here is a description of that project.

I was working for a contract for ThisNext. Their database has a large number of records of different items, which a large number of people have tagged with arbitrary tags. Think millions of items, millions of distinct tags, and hundreds of millions of times that tags were applied to items. (Often the same tag has been applied to the same item multiple times.) Their idea was that this data could be used to discover which other items are somehow "related" to the current one, and possibly of interest if you find the current one interesting. This is used to populate related items, which is intentionally somewhat whimsical. For instance if you are interested in a pair of silk hotpants then perhaps you might be interested in velvet shorts, lumberjack hotpants or a Bo Peep Jumpsuit.

The suggestions are intentionally somewhat whimsical, and are generated by user activity. The idea behind them is that two items with tags in comment are likely to be related. The more common the tag, the less related they are. For every item they wanted the 30 most related items, sorted in the order of how related they are.

There is an obvious computational problem that you will face. Suppose a popular tag was applied to 30,000 different items. Then for 30,000 items you're going to have to compare to 30,000 other items, which gives you 900,000,000 relationships. This clearly is a scalability problem. So their initial version had implemented this as they just counted the number of tags in common between items, and they skipped any tag that was used over 500 times.

This originally worked fairly well, but as time went by it slowly got worse. The first problem is that their program slowed down as they got more data. When I arrived it was taking a couple of hours to run and was affecting when they could start running backups. The second problem was that if you had an item that had only been tagged once with a popular tag, such as "sunglasses", then no suggestions would be generated for it. They wanted me to try to solve one or both, preferably both, problems.

My first step was to try to figure out the ideal result. And then I was going to try to optimize that. So my first step was to try to find a smooth function that worked over multiple orders of magnitude, that captured the idea that a common tag was less of a relationship than an uncommon tag. The phrase "multiple orders of magnitude" made me think of logarithms, and so I decided that if a tag had been used n times for one item, m times for the other, and s times all told, then log(n + m)/log(s) would be the strength of that relation. The total weight of the connection between two items, I decided, should be the sum of the weights from the shared tags. I decided to break ties in favor of the most tagged other items. (Thinking that this was evidence of popularity, and popular items were probably better.) And any remaining ties in favor of the most recently created item. (Being tagged a lot in a short time is also evidence of popularity.)

Note that this formula is reasonable, but arbitrary. It was chosen not out of any knowledge of what the "right" answer is, but rather as the simplest reasonable formula that looked like it reflected all of the ideas that they wanted to capture.

Next came finding a potentially efficient way to implement it. My idea was that I could do it with two passes. In the first pass I'd compare each item with up to the top 100 items withing a tag. Then I'd sort and find the top 100 related items. In the second pass I'd figure out the full relationship between those items, and sort in the final order. This would prevent the quadratic performance problem that kept popular tags from being analyzed in the first implementation.

If you are familiar with databases then you may wonder how I was planning to pick the top 100 items from each tag, for all tags at once. If I was using Oracle I could just use analytic queries. That feature made it into the SQL 2003 standard, and so support for windowed functions is showing up in other major databases. But I wasn't using a database with support for that. I was using MySQL. However MySQL has an efficient hack for this problem which is described in http://www.xaprb.com/blog/2006/12/02/how-to-number-rows-in-mysql/. (If you're doing a lot of work with MySQL, that hack is worth knowing about.) With that hack it is easy to have a query that will return the top 100 items for each tag.

So I coded up my program as a series of MySQL queries. When it came to actually processing the items there was no way that there was space for the database to handle it in one big query, so I opted for having a temp table into which I stuck ranges of items, which I'd then process in batches. The program ran..but it took 5 days to finish. This obviously was not going to be good enough. I thought about it long and hard, and decided that the best way to optimize would be to try to rewrite it in something other than SQL.

Normally I'd use Perl, but Perl wastes memory fairly liberally, so I was sure that that it couldn't store this dataset in RAM. I am not a C++ programmer, but I noticed is that C++ has vectors. Vectors support the same basic operations that I am used to in Perl. But (unlike Perl arrays) they are memory efficient and very fast. So I talked to the company, got permission to try rewriting this in C++, then went to work.

According to my memory, I used 4 data structures.

item: A vector of structs. Each struct said how often the item had been tagged, where the item_tags started, and the number of item_tags there were. (The item_id was your position in the vector.)
item_tag: A vector of structs. Each struct knew the item, the tag, and how many times that tag had been applied to that item. It was sorted by item_id then tag_id.
tag: A vector of structs. Each struct knew the number of times that tag had been tagged, where the tag_items started, and the number of tag_items there were.
tag_item: A vector of structs which is exactly like item_tag but this one was sorted by tag and then item.

Again I wrote my program in 2 passes exactly like before.

For each item the first pass was meant to identify which other items might possibly be the most related. Here it is in pseudocode:


foreach tag for that item (found in item_tag, the range is from item_tag):
    from tag figure out the range of tag_items I was interested in
    (the size of the range was a constant, other_items_limit)
    foreach other item found in tag_items:
        if item and the other item differ:
            save the existence of that relationship in relationships
            (a relationship contains the other item id, the weight, and
            the popularity of the other tag = how many times it was tagged)

    sort relationships by other item id

    aggregate relationships by other item id (aggregating weights across tags)

    sort relationships by weight, times other item tagged, and other item id
    (all descending)

    pick top other_items_limit items for the second pass

The purpose of the second pass is to find the final list of other items. Here it is in pseudocode:


foreach other item found in the first pass:
    weight := 0
    traverse item_tag in parallel for both:
        if a common tag is found:
            add to the calculated weight
    save the relationship

sort relationships by weight, times other item tagged, and other item id
(all descending)
for the top 30 relationships:
    print this relationship to a file

Remember that the original SQL version took 5 days. The C++ program took under 20 minutes. I got rid of a details file that wasn't actually needed except for debugging, and I stored all of the logarithms that I needed in a vector so I could do an index lookup rather than an expensive floating point calculation. Total run time dropped to under 5 minutes. I left in various assertions that had been a development aid, and called it a job well done. Together with the steps to run queries, extract a dataset, run the program, and do the final upload, end to end I was generating the full result set in something like 15 minutes.

This astounded me. I had good reason to believe that C++ was going to be faster than the database. But 3 orders of magnitude faster? That I did not expect!

I believe that the biggest single difference is the work you need to do to access data.

What is the fastest way to get at an isolated piece of information in the database? You do an index lookup. Meaning that you do a binary search through a tree that then tells you what database row has your information and what page it is to be found on. (Don't forget that walking a binary tree means multiple comparisons, if/then checks, etc.) You then read that page of memory, parse through it to find the row you want, parse the information you want out of that row and return it. Note that the instructions for what exactly to do is likely represented in some sort of byte code that has to be interpreted on the fly.

In my C++ program the equivalent operation is, "Access vector by index." Is there really any surprise that this is several orders of magnitude faster?

There are other differences as well. In the database I was working hard to keep batches small enough that the temporary tables I was throwing around would fit in RAM. In C++ a good deal of the time the next piece of information I wanted would be in L1 or L2 cache inside of the CPU. (The fact that I was walking through data linearly helps here.) MySQL likes to represent numeric types in inefficient ways, while the C++ program uses faster native data types. MySQL had to calculate a lot of logarithms, which was work I mostly avoided in C++. These things all helped.

It is also possible that I had some fixable performance problem in my SQL solution. But in discussing this project I have encountered a number of other people who have had similar experiences. And the general order of magnitude of speedup that I encountered is apparently not particularly rare when going from SQL to C++.

However despite this experience, I remain a big proponent of pushing work to the database and not worrying about performance unless you need to. Why? Because I've pushed databases much harder than most people do. Only a handful of times have I hit their limits. And until I hit those limits, they do a lot to simplify my life, and speed development.

But I am now aware exactly how many orders of magnitude faster I can make things by rewriting in C++. But with a giant corresponding increase in development time and required testing before you know you have correct results.

A final note. This project was fun for me, but what did ThisNext think? They were delighted! I solved their batch processing performance problem. In testing the suggestions produced were clearly better. And based on traffic and revenue figures following the release of my change, I was told that I had increased revenue by at least 10%.

I've had to build a lot of reports, and in my last job I build a reporting system that I think worked quite well. This post explains some of the key ideas behind that system.

One of my best ideas was to use Excel, not compete with it. Users will come with detailed requests for the exact reports they want. You can try to provide what users ask for. Inevitably they will change their mind, or the report lovingly described by someone's manager won't be what that person wants, or a million variations on this theme. This will generate an endless stream of requests as people try to get you to tweak what is being provided into their exact desire. Alternately you can make the data from your system directly accessible in Excel, and encourage people create the exact reports they want themselves in an interface they are comfortable with. Now all that you need to do is provide raw data, appropriately aggregated, and you're no longer a bottleneck. Giving people what they want rather than what the initially ask for makes for happier users.

Luckily it is easy to integrate with Excel. What you need is a way for users to design a web-based report which provides the raw data they want, and make it available with a short URL (less than 50 characters). Now a user can open up a new Excel workbook, then choose Data, Import External Data, then New Web Query. This pops open a browser. The user puts in their short URL, and the browser loads it. The user can then click on the HTML table that they want, save the web query, and Excel will run it again then fill a spreadsheet with the data in that table. Now a complex set of spreadsheets can be built off of that data, and every time they refresh it will re-run the report and get fresh data.

That's the user experience. Behind the scenes the developer needs to do several things to make this work.

Provide users a way to save the parameters of a given report, and create a named report. The raw parameters will tend to be too long to be saved properly. When users access this named report you must not issue a redirect. If you do a redirect it will work initially, but Excel internally remembers the redirected location. Which means that if the user tweaks the saved report, the Excel spreadsheet won't notice the tweak.

In your HTML give an unambiguous ID attribute to the table with the data. Excel will use this ID to locate the data. Otherwise it will try to analyze the structure of the HTML, and that will break easily when the report changes.

Have a way to easily specify relative dates. People frequently want reports that run to the present, run for last month, and so on. If they only had fixed dropdowns when they set up the report then the Excel spreadsheets they get won't update to what they want it to be. I solved this problem by providing an optional text field that I passed to Perl's Date::Manip to parse. (I actually improved the date parsing slightly for my purposes, but that was the base.)

Moving on, my philosophy about reports is that 1 report that can be tweaked in 100 ways is more flexible and powerful than 10 reports that can be tweaked in 10 ways each. Similarly when given a one-off reporting request, it is better to find a way to add something to an existing report to make it capable of handling that request than to do the one-off request. That is because if someone wants that feature once, someone else is likely to want it again. And there is something really satisfying about receiving the second request and being able to say, "I've already built that, here let me show you..."

Of course the challenge is converting that philosophy into action. How exactly do you go about building a flexible report that can be tweaked in many ways?

My strategy was to look at each report as an array of SQL statements. All statements before the last would create temporary tables. The last query would return the data set that I would display. Thus if I needed to pull in extra information - for instance I needed to segment data by whether a given promotion was run - when building the array I could dynamically insert a query to fetch that particular piece of information, and then pass it through all of the tables.

This strategy naturally leads to two useful features. The first is that you can add the exact SQL statements used to generate the report in an HTML comment. This is very useful for auditing purposes. The second is that you can add the feature of having a dropdown box listing the temporary tables that will be created, and allow the report to run to that point then display those intermediate results. This is helpful when debugging. And not just for the reporting engineer - it was very useful for me when the accounting department was able to come to me and say, "This report, at this step, is missing these invoices."

This strategy, of course, requires being able to create definitions for temporary tables on the fly. Most databases provide that. Whether you are in MySQL, PostgreSQL, MS SQL, Sybase, etc you have exactly the right kind of temporary table available. The glaring exception is Oracle. There are a couple of workarounds available with Oracle. I've tried automatically building queries with large subqueries. It kind of works, however the result is unreadable by end users, you can't easily display intermediate results, and I miss the ability to control execution plans by creating temporary tables and adding appropriate indexes on the fly before moving on.

The other requirement of this strategy is that your SQL statements can be customized by the addition of an arbitrary list of extra fields, join conditions, etc. I wrote more reports than I care to admit before I realized that the best way to do this is to use a templating system to build your SQL. Now that I have used templates to dynamically generate SQL statements, I wouldn't want to go back. I used Perl's Template Toolkit for that purpose. Again I tweaked it substantially for my purposes. But many different templating systems could work well.

Moving on, let's look at how I organized things under the hood.

Obviously I needed complex forms with lots of checkboxes, input boxes, dropdowns, and the like. I chose to make each form control into an object that knew how to find its data out of the form parameters, do validation, write out its HTML, and pass information on itself to the templates. Some of the objects had SQL statements attached so that I didn't have to duplicate logic between reports.

There are endless arguments in software circles about separating presentation from content. In some circles mixing content and presentation as these objects did is anathema. In general web development I would not be inclined to move the HTML for form elements into what is basically a model object. However in this case it worked out very well. The form is so directly tied to the action of the report that it wouldn't make sense to have different people work on the two. These reports were for internal use, and so functionality mattered more than aesthetics. And as a practical matter this choice made the addition of options to forms extremely easy.

And, of course, I needed to have a way to coordinate everything. I did that with a report object. The report class had methods to create each kind of form object. It had a method to be displayed. When displayed it would show all of the form elements, check whether the report should be generated, if so it would run all of the queries, and display the report, and (of course), it would append necessary documentation.

With all of these pieces the code for a trivial report could look something like this:


use Report;

my $report = Report->new();

$report->add_textbox(
    name => "user",
    label => "Who is this for?",
);

$report->add_textbox_number(
    name => "age",
    label => "How old are you?",
);

my @queries;

push @queries, {
    temp_table => "results",
    sql => qq{SELECT :user as "User", [% age %] as "Age", current_date as "Today"},
    args => {
        user => $report->user,
    },
};

$report->display(
    title => "My Report",
    queries => \@queries,
    doc => qq{
        <ul>
            <li> <i>User:</i> The user this report was generated for.</li>
            <li> <i>Age:</i> Their reported age.</li>
            <li> <i>Today:&t;/i> The current date.</li>
        </ul>
    },
);

Now this report isn't very impressive. Also passing the age into the SQL as a template variable is poor style. (I did that just to show what the templating looked like.) But if you changed this to having a dozen form controls tweaking a data processing step with a half-dozen queries then you could have a very powerful and flexible report. Add in a few more reports, the ability to combine independent options, and the ability for users to build on top of this in Excel, and you wind up with a surprisingly powerful and flexible reporting system.

Random Observations

Sunday, February 13, 2011

Finding related items

Friday, December 11, 2009

Design of a reporting system

Blog Archive

About Me