Friday, December 11, 2009

Design of a reporting system

I've had to build a lot of reports, and in my last job I build a reporting system that I think worked quite well. This post explains some of the key ideas behind that system.



One of my best ideas was to use Excel, not compete with it. Users will come with detailed requests for the exact reports they want. You can try to provide what users ask for. Inevitably they will change their mind, or the report lovingly described by someone's manager won't be what that person wants, or a million variations on this theme. This will generate an endless stream of requests as people try to get you to tweak what is being provided into their exact desire. Alternately you can make the data from your system directly accessible in Excel, and encourage people create the exact reports they want themselves in an interface they are comfortable with. Now all that you need to do is provide raw data, appropriately aggregated, and you're no longer a bottleneck. Giving people what they want rather than what the initially ask for makes for happier users.

Luckily it is easy to integrate with Excel. What you need is a way for users to design a web-based report which provides the raw data they want, and make it available with a short URL (less than 50 characters). Now a user can open up a new Excel workbook, then choose Data, Import External Data, then New Web Query. This pops open a browser. The user puts in their short URL, and the browser loads it. The user can then click on the HTML table that they want, save the web query, and Excel will run it again then fill a spreadsheet with the data in that table. Now a complex set of spreadsheets can be built off of that data, and every time they refresh it will re-run the report and get fresh data.

That's the user experience. Behind the scenes the developer needs to do several things to make this work.
  • Provide users a way to save the parameters of a given report, and create a named report. The raw parameters will tend to be too long to be saved properly. When users access this named report you must not issue a redirect. If you do a redirect it will work initially, but Excel internally remembers the redirected location. Which means that if the user tweaks the saved report, the Excel spreadsheet won't notice the tweak.

  • In your HTML give an unambiguous ID attribute to the table with the data. Excel will use this ID to locate the data. Otherwise it will try to analyze the structure of the HTML, and that will break easily when the report changes.

  • Have a way to easily specify relative dates. People frequently want reports that run to the present, run for last month, and so on. If they only had fixed dropdowns when they set up the report then the Excel spreadsheets they get won't update to what they want it to be. I solved this problem by providing an optional text field that I passed to Perl's Date::Manip to parse. (I actually improved the date parsing slightly for my purposes, but that was the base.)



Moving on, my philosophy about reports is that 1 report that can be tweaked in 100 ways is more flexible and powerful than 10 reports that can be tweaked in 10 ways each. Similarly when given a one-off reporting request, it is better to find a way to add something to an existing report to make it capable of handling that request than to do the one-off request. That is because if someone wants that feature once, someone else is likely to want it again. And there is something really satisfying about receiving the second request and being able to say, "I've already built that, here let me show you..."

Of course the challenge is converting that philosophy into action. How exactly do you go about building a flexible report that can be tweaked in many ways?

My strategy was to look at each report as an array of SQL statements. All statements before the last would create temporary tables. The last query would return the data set that I would display. Thus if I needed to pull in extra information - for instance I needed to segment data by whether a given promotion was run - when building the array I could dynamically insert a query to fetch that particular piece of information, and then pass it through all of the tables.

This strategy naturally leads to two useful features. The first is that you can add the exact SQL statements used to generate the report in an HTML comment. This is very useful for auditing purposes. The second is that you can add the feature of having a dropdown box listing the temporary tables that will be created, and allow the report to run to that point then display those intermediate results. This is helpful when debugging. And not just for the reporting engineer - it was very useful for me when the accounting department was able to come to me and say, "This report, at this step, is missing these invoices."

This strategy, of course, requires being able to create definitions for temporary tables on the fly. Most databases provide that. Whether you are in MySQL, PostgreSQL, MS SQL, Sybase, etc you have exactly the right kind of temporary table available. The glaring exception is Oracle. There are a couple of workarounds available with Oracle. I've tried automatically building queries with large subqueries. It kind of works, however the result is unreadable by end users, you can't easily display intermediate results, and I miss the ability to control execution plans by creating temporary tables and adding appropriate indexes on the fly before moving on.

The other requirement of this strategy is that your SQL statements can be customized by the addition of an arbitrary list of extra fields, join conditions, etc. I wrote more reports than I care to admit before I realized that the best way to do this is to use a templating system to build your SQL. Now that I have used templates to dynamically generate SQL statements, I wouldn't want to go back. I used Perl's Template Toolkit for that purpose. Again I tweaked it substantially for my purposes. But many different templating systems could work well.



Moving on, let's look at how I organized things under the hood.

Obviously I needed complex forms with lots of checkboxes, input boxes, dropdowns, and the like. I chose to make each form control into an object that knew how to find its data out of the form parameters, do validation, write out its HTML, and pass information on itself to the templates. Some of the objects had SQL statements attached so that I didn't have to duplicate logic between reports.

There are endless arguments in software circles about separating presentation from content. In some circles mixing content and presentation as these objects did is anathema. In general web development I would not be inclined to move the HTML for form elements into what is basically a model object. However in this case it worked out very well. The form is so directly tied to the action of the report that it wouldn't make sense to have different people work on the two. These reports were for internal use, and so functionality mattered more than aesthetics. And as a practical matter this choice made the addition of options to forms extremely easy.

And, of course, I needed to have a way to coordinate everything. I did that with a report object. The report class had methods to create each kind of form object. It had a method to be displayed. When displayed it would show all of the form elements, check whether the report should be generated, if so it would run all of the queries, and display the report, and (of course), it would append necessary documentation.

With all of these pieces the code for a trivial report could look something like this:

use Report;

my $report = Report->new();

$report->add_textbox(
name => "user",
label => "Who is this for?",
);

$report->add_textbox_number(
name => "age",
label => "How old are you?",
);

my @queries;

push @queries, {
temp_table => "results",
sql => qq{SELECT :user as "User", [% age %] as "Age", current_date as "Today"},
args => {
user => $report->user,
},
};

$report->display(
title => "My Report",
queries => \@queries,
doc => qq{
<ul>
<li> <i>User:</i> The user this report was generated for.</li>
<li> <i>Age:</i> Their reported age.</li>
<li> <i>Today:&t;/i> The current date.</li>
</ul>
},
);



Now this report isn't very impressive. Also passing the age into the SQL as a template variable is poor style. (I did that just to show what the templating looked like.) But if you changed this to having a dozen form controls tweaking a data processing step with a half-dozen queries then you could have a very powerful and flexible report. Add in a few more reports, the ability to combine independent options, and the ability for users to build on top of this in Excel, and you wind up with a surprisingly powerful and flexible reporting system.

13 comments:

Aristotle said...

I use Dave Rolsky’s Fey to piece together queries, and wouldn’t want to give it up for anything.

Of course, the queries aren’t formatted any particular way, and the SELECT clause generally grows unwieldy – so the resulting queries couldn’t be shown to end users, at least not easily. But I would sooner be willing to spend effort to solve this issue in some way (SQL::Beautify?) than I would be to go back to treating SQL as strings.

jakestein said...

Ben,

I definitely agree with you on the importance of integrating with excel and specifying relative dates when building a reporting system. Did you think about trying out an externally developed reporting package rather than building internally? IMO, it's better to give the consumers of reports the ability to go through the entire process of creating, consuming, and modifying reports without incremental work from you / the tech team. Of course, there will be times where users get stuck, but then support is the vendor's problem, no yours. Plus, you're not responsible for bug fixes, new features, etc.

Disclosure: I'm more than a little biased. This is something that I have spent a lot of time on, and would be curious to hear your thoughts.

btilly said...

@jakestein: When I started in that job the first thing that I looked at was the viability of third party reporting solutions.

However it quickly became clear that one of the biggest problems that organization had was a sprawling collection of different databases with inconsistent (and generally not well thought out) schemas for information. For example I had to look in different databases for user information, data on promotions that had been run, tracking of order fulfillment, and financial data. That made it difficult to produce reports that combined all of those different types of data. As you moved from table to table, fields would frequently change names and data types. And of course the only place this was documented was in the brains of the people who developed and QAed it.

Users already had direct access to that data. But nobody had the domain knowledge to make heads or tails from it. In a few weeks of searching I didn't find any third party solutions that offered a way to reverse engineer any type of sanity on top of a sprawling data mess like that. I knew that I could grow a system that did.

As for the idea of exporting the whole problem to a third party, that strikes me as a bad business decision. Understanding of data formats and structure is institutional knowledge. A third party vendor has real disadvantages and no real advantages over the company in learning that knowledge. Some of that data is highly sensitive. And once a company is dependent on a third party for understanding its internal data, you've got a level of vendor lock-in that is going to cost you dearly down the read. (Read Information Rules for common business strategy around third party vendors and vendor lock-in.)

btilly said...

@Aristotle: I confess that I've never seen the point of using SQL generators for complex SQL.

ORMs are one thing - if someone is trying to develop something in a standard programming language then it makes sense to provide an abstraction layer to make the database details invisible within application code.

However complex queries are a different story. SQL is a reasonably standard and widely understood language that is more compact and expressive for what it does than any query builder that I have seen. What then is the point of learning another layer so that you can program in a less expressive way? Particularly when you wind up with arbitrary restrictions when you are trying to do complex things with the database. Fey claims to be better at that issue than alternatives. Maybe so. But I'd be surprised if there weren't things it misses that SQL has. Places I have seen issues in the past include
- Performing left joins
- Distinguishing between whether a condition goes in the WHERE or the ON clause of a left join (they are different)
- Handling joining to the same table multiple times.
- Handling self joins.
- Distinguishing between the following two:

  SELECT ...
  FROM foo
    LEFT JOIN bar
      ON foo.bar_id = bar.id
    LEFT JOIN baz
      ON bar.baz_id = baz.id
        AND baz.some_field <> 'cond'

and

  SELECT ...
  FROM foo
    LEFT JOIN (
      bar
      LEFT JOIN baz
        ON bar.baz_id = baz.id
    )
    ON foo.bar_id = bar.id
        AND baz.some_field <> 'cond'

(The parentheses are actually not needed in the last case, but stacking ON clauses without them gets confusing.)

If you're not having to work closely with SQL then these may seem like esoteric points. However when you're doing reporting in a serious way you encounter these and more fairly regularly.

Moving on, I actually wasn't using ginormous SQL queries. Instead I was breaking them up into a series of data steps using temp tables. The benefits of this approach are sufficiently big and non-obvious that I think I need another blog entry on it when I get time. But as for just one - it is easy to get well-formatted SQL for complex reports that is broken into digestible chunks which end users can follow.

Aristotle said...

- Performing left joins
- Distinguishing between whether a condition goes in the WHERE or the ON clause of a left join (they are different)
- Handling joining to the same table multiple times.
- Handling self joins.
- Distinguishing between the following two:


Fey passes all of these tests. There are still some things it has trouble with, but you have to get more creative than these. Plus, Dave’s goal is to map out pretty well everything that can be done with plain SQL; if Fey::SQL cannot yet express it, most likely it will be added (particularly if you proffer a patch).

Mind, this isn’t too surprising as it doesn’t try to simplify SQL writing for you in any way, anyway. It just gives you use OO syntax that corresponds 1:1 to SQL.

The point of that is it lets you write queries using symbols instead of strings, ie replaces table and column names with Fey::Column and Fey::Table along with corresponding ::Alias classes. It also lets you combine clauses in arbitrary order instead of the fixed order in which SQL demands them.

The upshot is that you can extract generic aspects from queries and write them in a reusable way.

F.ex. I used this to write a generic tagging (as in “folksonomies”) query aspect: I pass in a query object, a record table object, a tag table object, and a bunch of tags, and the function takes each tag, generates an anonymous tag table alias for it, joins that against the record table, and puts a WHERE condition on the tagname column for that alias. Now I don’t have to write query templates carefully laid out to make it possible to add tagging to them, nor go back and rearrange them if I didn’t write them that way to begin with. I can pass any query plus specific tables to this function, and it will weave the requisite bits into the query to restrict it by some tags.

It’s nice.

Building queries using templates as I did before was much more complicated and hairy.

btilly said...

@Aristotle: It does all of that? I'm surprised. Other ones that I have looked at don't. They have particularly had trouble with subtle details about conditions in ON clauses (including the need to rearrange ON clauses in some cases).

How does Fey do with temporary tables created (and sometimes dropped then recreated) on the fly? Looking at Fey it looks like it wants to know your schema, but there is no permanent schema there.

I ask because I have found that temporary tables have a number of advantages that I would not want to lose. I still mean to get around to a post on them in a few days, but just for a simple example, I have run across situations where the database didn't find a good execution plan for a multi-table join. With temporary tables you can force the execution plan by joining some things into a temporary table, adding indexes, then doing the main join.

This has sped up slow reports often enough that I wouldn't want to lose that tool from my toolbox.

That aside, even if Fey can do everything that SQL can, I still don't see the value add. You say that it gives me an OO syntax. But I don't care about OO syntax when looking at reports that basically are a series of SQL statements, display the result of the last one. You say that it lets you rearrange clauses. That's a pretty minor win from my point of view. I'd need something a lot bigger to add the dependency.

jakestein said...

@btilly: I think you're right that no reporting package can go into a random company's data and automatically make sense of whatever they've got. I do, however, think that if you focus on a specific industry that tends to store data in certain ways and tends to want a finite set of metrics, then there are very large efficiencies to be had through a third party focusing on nothing but analytics and reporting for these companies. I used to do this manually at my old job, and now build software that automates a large portion of the process.

Also, I disagree that this presents a big risk around vendor lock in. If you were to outsource your database analytics to a third party, you can fire them at any time, and you still have 100% of your data with zero effort, because your database is still in house. You can't say the same for the vast majority of outsourced, software as a service categories that are popular. I'm thinking of web traffic analytics, crm, project management, etc. With those tools, you're dependent on third party vendors for functionality and the underlying data. In many cases you can get at the underlying data through an API, but it's not as easy as if the data remained in house on your own db, and you have to work to get it into a format that is useful to you. Then again, you may prefer to use internally built software for some or all of those categories to avoid vendor lock in.

In any case, thanks for your response.

Yves Hiernaux said...

Hi Ben,

I am a firm believer in your model and the recent moves of Microsoft and Google Apps in the online sheets sharing is just another good step.

I would really be interested in knowing how you created a secure way to link excel sheets and web apps.

Was it an API ? Do the users have to enter login/pass each time in the sheet ?

How do you control authorization and access to data based on role, or something, ... ?

It would be really great to know a bit more.

btilly said...

@Yves: I had the luxury of not having to worry about authentication because the application was only on the internal network, and the execs were OK with not having any security on it.

But if I did need security, I would have tried adding authentication on the web page (by configuring HTTP digest authentication), and then seeing whether Excel handled it. I suspect it would have.

Once you have authentication, you can set up authorization in any way you want. That is a much trickier issue because the fundamental goal is to let the organization restrict access to data in the way that it wants. And there is no way to guess what kinds of rules they will want without knowing the organization.

Yves Hiernaux said...

As long as you are on MSFT platforms and use NTLM, all is fine.

You browse an intranet pages with few reports. You pick one, and it opens Excel.

You are automatically authenticated, the ASP/ASPX page receives the user, can apply a user profile and the backend returns authorised data.

Now if you are on the internet or intranet without MSFT servers, the only difference is when Excel opens, you get a popup to authenticate.

Users get quickly annoyed as this popup comes for each report you open.

Not to mention the fight you will get with SSO guys.

We used this technique for getting SAP reports for some big corporations, but dropped it as they were all moving to JAVA servers.

There are some hackery, for intranets, to get NTLM on JAVA, but IT validation teams didn't like it in general.

OpenOffice does not even have the luxury of NTLM, so it is always with a popup.

Now if there is a solution for that, I'm a big fan of web query for reports.

We considered it for our new project BeeBole but had to drop it for the same reasons.

btilly said...

That is annoying. It is just like Microsoft to try to use their control of one technology (Excel) to push others (ASP.net).

One possible end run is to try creating an IQY file and adding the name and password in POST parameters. See this article for information on how to set that up. That moves the authentication information into the spreadsheet. If you make the URL be https, then the password is now transmitted in encrypted form, and you are as secure as your Excel worksheet is.

Another thing that I'm wondering is whether you can solve the problem using cookies. Using session cookies might not work because it probably launches a new IE instance for each web request. But what happens if you set short lived persistent cookies (say 2 hours)? Those should be shared across different launches of IE, and may resolve your problem nicely. (You may need to do some extra work on the server side to use persistent cookies for authentication rather than session cookies.)

Yves Hiernaux said...

In general, and hopefully the server have passwords encrypted.

And it would be insecure, and rather inconvenient, to ask users to write their credentials in an IQY file.

Excel and IE are 2 separate worlds, I don't think you can share cookies between them.

But there are other browsers and OpenOffice too.

dorla said...
This comment has been removed by a blog administrator.