Monday, August 8, 2011

Financial stress on the system

I posted this last week on Google+ at https://plus.google.com/u/0/114613808538621741268/posts/13tBzre1nRa. Given market drops since then, I'm reposting it in a more public forum.

How could our next financial crisis start? I was in an interesting discussion about this last night. I'd give better than even odds that it will be ugly before Christmas. Here are some of the possibilities to watch.

- General economic slowdown. The economy seems to be slowing. Plus there are all of the other stresses I'll discuss, all of which are likely to make it slow more. But it seems implausible that a general slowdown will happen suddenly enough to precipitate a full-blown crisis. Though it sure provides stress that could cause something else to happen.

- European sovereign debt issues. We all have heard about Greece. Italy is in the news. But the really scary quantities of debt are in Spain. And Europe has proven to have no political appetite for having meaningful bailouts of one country by another. Either the prospect of the EMU (European Monetary Union) dissolving, or a liquidity crisis caused by a large default (even if it doesn't get called a default) is a likely bet.

- US sovereign debt issue. We dodged the debt ceiling debate. 2 of the 3 rating agencies have left the USA at AAA, but on a negative watch. (Meaning better than even odds of a downgrade within a year.) The third, S&P, based on past guidance is likely to downgrade the USA this month. Though they could do what the other two did. The next test is what the super committee comes up with for the second half of the cuts that are in the debt ceiling deal. That's expected around Thanksgiving. Whether that triggers a crisis depends on what is cut. But a crisis is unlikely. The bigger problem is that people have been backed into a political corner where the US government will be unable to intervene if something else goes south.

- Residential mortgages. The current expectation in the industry is a 3-6% drop next year, with 15% as a worst case scenario. However if the mortgage tax deduction were to be yanked (for instance as part of deficit reduction), much worse changes are possible. (And, of course, another financial crisis would take all bets off of the table.)

- Commercial mortgages. Everyone is aware that commercial mortgages are in a world of pain. However servicers have been very forgiving. They really don't want to force bankruptcies which are in nobody's best interest, so they restructure deals, collect as much blood as they can, then limp along. (A clear case of, "If you owe $100,000, you have a problem. If you owe $100,000,000 the bank has a problem.") However the sector is still under stress. And last month the delinquency rate spiked to the highest rate on record. There could still be trouble there. Particularly if a slowing economy reduces how much money is available to pay back debt.

- Private equity. There is a world of pain coming here as deals struck in 2006 and 2007 (2007 had a lot more) see principal payments come due this year and next. Based on the experience with commercial mortgages, I would expect the banks to be unusually forgiving. But the question is whether they have sufficient liquidity to be forgiving enough. (As much as the banks may want to make deals, they have a real problem if nobody pays them back.) Borders just went bankrupt, how many more companies will follow in the next year?

- Municipal debt. Meredith Whitney has been crying wolf on this for some time, but she has a real point. There are trillions of dollars of debt issued by municipal governments who are under stress. In the first half of this year they got unexpected revenue increases. However a slowing economy could cause them to get back in trouble. And their budget cuts to deal with financial pain could slow the economy. To add salt to the wound, most have large investments in pension plans that is budgeted to get 8% annual growth. Those rates are hard to reach in today's economy, so a lot of that money is in risky asset classes like investments in private equity. It is very likely that those assets will under perform, adding to budget holes. Central Falls, Rhode Island just filed for bankruptcy. How many more will follow in the next year?

- Chinese asset bubble pops. I know little about what is going on there. They do have a big asset bubble. History says that it will pop eventually. Nobody knows when, or what the fallout will be. However my impression is that when it happens, the pain will be mostly felt in China.

- External shocks. You name it. Successful terrorist attack. (The 10'th anniversary of 9/11 is coming up...) Unrest in the Middle East triggers oil spike. An unexpected major earthquake. (For instance the Cascadia fault slips, leaving over 100K dead in Seattle, Vancouver and Victoria.) These are all unlikely, but possible.

Tuesday, July 26, 2011

A worst case for the possible US government default

(This is crossposted from Google+. If you use Google+, you scan find the original here.)

Lots of people are scared about the possibility of a US government default. But I have an unusual reason that few share. And many might think me crazy for worrying about one of the things that I worry about. So I'd like to share why I think what I do before I confess to my worry.

Close to a decade ago I read a fascinating book, Wealth and Democracy by Kevin Phillips. SIt is a large book. It is a detailed book. Most of it is a detailed history of when and where great wealth was created in US history, and its effects on the political process. (In a nutshell the conclusion is that periods of great wealth are periods where the wealthy wield great political influence, and conversely periods where the wealthy have less political control saw great declines of wealth disparities.) Don't try reading it unless you can get into the idea of reading close to 500 pages from a policy wonk going on about things that happened a hundred years ago.

Of course if you are able to read it, there is a lot of interesting material there. Kevin Phillips put a lot of effort in, and knows his stuff. He also has a history of making absurd predictions that work out. For instance in 1970 he wrote The Emerging Republican Majority which made the (then) absurd prediction that white Southerners who reflexively voted against the Republicans because of their role in the Civil War, were about to switch to the Republican party. Nixon followed the strategy that Phillips outlined, and they eventually did. In 1990, when Bush Sr had the highest approval rating of any president ever, Phillips wrote The Politics of Rich and Poor: Wealth and Electorate in the Reagan Aftermath which argued that Bush was vulnerable. Clinton later said that he used it as his election plan.

So what absurd prediction did he make in Wealth and Democracy? Well here is it as best as I remember it. In the last 400 years there have been three other examples of a country that was the sole global superpower. (Spain, the Netherlands, and England.) All three followed a remarkably similar trajectory. All had a major boom that was more based on financial speculation than real production. During this bubble, actual production was outsourced to other countries. This bubble crashed. A couple of years later all three got into a foreign adventure that was supposed to be quick and easy. Said foreign adventure turned out to be much longer and more costly than predicted. Parallel to this there was a "recovery" in which average people didn't really get much ahead. During this period there was even more outsourcing of real production. That ended in a credit crisis some 7-8 years later. Following the credit crisis there was a paper recovery whose effects weren't widely felt. That was not many years later followed by a second financial crisis. After the second financial crisis there was growing outrage in the population. Followed, almost exactly 15 years after the peak, by civil war in 2 of the cases. In the case of England there was civil unrest, but it was overwhelmed by WW I. The outcome was the establishment of a third major political party, Labour.

Incidentally in all three cases one of the countries that production and industrial knowledge was transferred to eventually wound up as the next superpower. Currently the best two candidates for that next superpower role are India and China.

Remember. This was written after the dot com bust, and before the Iraq invasion. Now compare with events of the last decade.

At first I thought, "He did a lot of work, and has a good track record, so I'll not discount it out of hand but I'll keep a lot of skepticism." Ever since then things have played out as predicted, almost like clockwork. I currently see lots of potential triggers for the second financial crisis. So I think that part is likely to play out, and my main hope is that we're more like England with civil unrest than Spain and the Netherlands with an actual civil war. However I have to say that the increasing polarization of the country over the last decade has made civil war much less unthinkable to me as a possibility than it was when I first read the book.

If the next financial crisis is triggered by a political stalemate, that makes me fear that civil war is more likely. Because the event will be one which arrives with already polarized constituencies that each has has a well-articulated story for the crisis that blames the other for not being willing to (cut government|consider any kind of tax). As financial disaster is felt by all, how can tensions fail to escalate along already clearly drawn boundaries? As they escalate, where will it end?

Please, Congress, find a way to compromise. If you fail, trillions of dollars in excess interest payments is the best possible outcome we can hope for. None of us wants to see the worst. Really.

Wednesday, July 20, 2011

I've switched to Google+

I've been interacting on Google+. You can see my new posts there. The content is different, but includes some content which previously would have appeared here, such as this explanation of homophobia..

Therefore if you were following me here, you may want to try to follow me there instead.

Sunday, February 13, 2011

Finding related items

I've mentioned in the past (not in this blog) that C++ programs can beat databases by many, many orders of magnitude. I learned this while doing a specific fun project. I have permission from the company to discuss that project, and here is a description of that project.

I was working for a contract for ThisNext. Their database has a large number of records of different items, which a large number of people have tagged with arbitrary tags. Think millions of items, millions of distinct tags, and hundreds of millions of times that tags were applied to items. (Often the same tag has been applied to the same item multiple times.) Their idea was that this data could be used to discover which other items are somehow "related" to the current one, and possibly of interest if you find the current one interesting. This is used to populate related items, which is intentionally somewhat whimsical. For instance if you are interested in a pair of silk hotpants then perhaps you might be interested in velvet shorts, lumberjack hotpants or a Bo Peep Jumpsuit.

The suggestions are intentionally somewhat whimsical, and are generated by user activity. The idea behind them is that two items with tags in comment are likely to be related. The more common the tag, the less related they are. For every item they wanted the 30 most related items, sorted in the order of how related they are.

There is an obvious computational problem that you will face. Suppose a popular tag was applied to 30,000 different items. Then for 30,000 items you're going to have to compare to 30,000 other items, which gives you 900,000,000 relationships. This clearly is a scalability problem. So their initial version had implemented this as they just counted the number of tags in common between items, and they skipped any tag that was used over 500 times.

This originally worked fairly well, but as time went by it slowly got worse. The first problem is that their program slowed down as they got more data. When I arrived it was taking a couple of hours to run and was affecting when they could start running backups. The second problem was that if you had an item that had only been tagged once with a popular tag, such as "sunglasses", then no suggestions would be generated for it. They wanted me to try to solve one or both, preferably both, problems.

My first step was to try to figure out the ideal result. And then I was going to try to optimize that. So my first step was to try to find a smooth function that worked over multiple orders of magnitude, that captured the idea that a common tag was less of a relationship than an uncommon tag. The phrase "multiple orders of magnitude" made me think of logarithms, and so I decided that if a tag had been used n times for one item, m times for the other, and s times all told, then log(n + m)/log(s) would be the strength of that relation. The total weight of the connection between two items, I decided, should be the sum of the weights from the shared tags. I decided to break ties in favor of the most tagged other items. (Thinking that this was evidence of popularity, and popular items were probably better.) And any remaining ties in favor of the most recently created item. (Being tagged a lot in a short time is also evidence of popularity.)

Note that this formula is reasonable, but arbitrary. It was chosen not out of any knowledge of what the "right" answer is, but rather as the simplest reasonable formula that looked like it reflected all of the ideas that they wanted to capture.

Next came finding a potentially efficient way to implement it. My idea was that I could do it with two passes. In the first pass I'd compare each item with up to the top 100 items withing a tag. Then I'd sort and find the top 100 related items. In the second pass I'd figure out the full relationship between those items, and sort in the final order. This would prevent the quadratic performance problem that kept popular tags from being analyzed in the first implementation.

If you are familiar with databases then you may wonder how I was planning to pick the top 100 items from each tag, for all tags at once. If I was using Oracle I could just use analytic queries. That feature made it into the SQL 2003 standard, and so support for windowed functions is showing up in other major databases. But I wasn't using a database with support for that. I was using MySQL. However MySQL has an efficient hack for this problem which is described in http://www.xaprb.com/blog/2006/12/02/how-to-number-rows-in-mysql/. (If you're doing a lot of work with MySQL, that hack is worth knowing about.) With that hack it is easy to have a query that will return the top 100 items for each tag.

So I coded up my program as a series of MySQL queries. When it came to actually processing the items there was no way that there was space for the database to handle it in one big query, so I opted for having a temp table into which I stuck ranges of items, which I'd then process in batches. The program ran..but it took 5 days to finish. This obviously was not going to be good enough. I thought about it long and hard, and decided that the best way to optimize would be to try to rewrite it in something other than SQL.

Normally I'd use Perl, but Perl wastes memory fairly liberally, so I was sure that that it couldn't store this dataset in RAM. I am not a C++ programmer, but I noticed is that C++ has vectors. Vectors support the same basic operations that I am used to in Perl. But (unlike Perl arrays) they are memory efficient and very fast. So I talked to the company, got permission to try rewriting this in C++, then went to work.

According to my memory, I used 4 data structures.
  1. item: A vector of structs. Each struct said how often the item had been tagged, where the item_tags started, and the number of item_tags there were. (The item_id was your position in the vector.)
  2. item_tag: A vector of structs. Each struct knew the item, the tag, and how many times that tag had been applied to that item. It was sorted by item_id then tag_id.
  3. tag: A vector of structs. Each struct knew the number of times that tag had been tagged, where the tag_items started, and the number of tag_items there were.
  4. tag_item: A vector of structs which is exactly like item_tag but this one was sorted by tag and then item.

Again I wrote my program in 2 passes exactly like before.

For each item the first pass was meant to identify which other items might possibly be the most related. Here it is in pseudocode:

foreach tag for that item (found in item_tag, the range is from item_tag):
from tag figure out the range of tag_items I was interested in
(the size of the range was a constant, other_items_limit)
foreach other item found in tag_items:
if item and the other item differ:
save the existence of that relationship in relationships
(a relationship contains the other item id, the weight, and
the popularity of the other tag = how many times it was tagged)

sort relationships by other item id

aggregate relationships by other item id (aggregating weights across tags)

sort relationships by weight, times other item tagged, and other item id
(all descending)

pick top other_items_limit items for the second pass

The purpose of the second pass is to find the final list of other items. Here it is in pseudocode:

foreach other item found in the first pass:
weight := 0
traverse item_tag in parallel for both:
if a common tag is found:
add to the calculated weight
save the relationship

sort relationships by weight, times other item tagged, and other item id
(all descending)
for the top 30 relationships:
print this relationship to a file


Remember that the original SQL version took 5 days. The C++ program took under 20 minutes. I got rid of a details file that wasn't actually needed except for debugging, and I stored all of the logarithms that I needed in a vector so I could do an index lookup rather than an expensive floating point calculation. Total run time dropped to under 5 minutes. I left in various assertions that had been a development aid, and called it a job well done. Together with the steps to run queries, extract a dataset, run the program, and do the final upload, end to end I was generating the full result set in something like 15 minutes.

This astounded me. I had good reason to believe that C++ was going to be faster than the database. But 3 orders of magnitude faster? That I did not expect!

I believe that the biggest single difference is the work you need to do to access data.

What is the fastest way to get at an isolated piece of information in the database? You do an index lookup. Meaning that you do a binary search through a tree that then tells you what database row has your information and what page it is to be found on. (Don't forget that walking a binary tree means multiple comparisons, if/then checks, etc.) You then read that page of memory, parse through it to find the row you want, parse the information you want out of that row and return it. Note that the instructions for what exactly to do is likely represented in some sort of byte code that has to be interpreted on the fly.

In my C++ program the equivalent operation is, "Access vector by index." Is there really any surprise that this is several orders of magnitude faster?

There are other differences as well. In the database I was working hard to keep batches small enough that the temporary tables I was throwing around would fit in RAM. In C++ a good deal of the time the next piece of information I wanted would be in L1 or L2 cache inside of the CPU. (The fact that I was walking through data linearly helps here.) MySQL likes to represent numeric types in inefficient ways, while the C++ program uses faster native data types. MySQL had to calculate a lot of logarithms, which was work I mostly avoided in C++. These things all helped.

It is also possible that I had some fixable performance problem in my SQL solution. But in discussing this project I have encountered a number of other people who have had similar experiences. And the general order of magnitude of speedup that I encountered is apparently not particularly rare when going from SQL to C++.

However despite this experience, I remain a big proponent of pushing work to the database and not worrying about performance unless you need to. Why? Because I've pushed databases much harder than most people do. Only a handful of times have I hit their limits. And until I hit those limits, they do a lot to simplify my life, and speed development.

But I am now aware exactly how many orders of magnitude faster I can make things by rewriting in C++. But with a giant corresponding increase in development time and required testing before you know you have correct results.

A final note. This project was fun for me, but what did ThisNext think? They were delighted! I solved their batch processing performance problem. In testing the suggestions produced were clearly better. And based on traffic and revenue figures following the release of my change, I was told that I had increased revenue by at least 10%.

Thursday, February 3, 2011

SQL formatting style

I have written considerably more SQL than most programmers have. And after you've written and maintained several hundred line queries, it becomes obvious that SQL is a programming language like any other. Which means that indentation helps comprehension for all the same reasons that it would in any other language. (I'm continually puzzled by programmers who fail to grasp this, and try to put SQL statements on one line.) But what indentation style should you use?

Here are the principles that lead to my current chosen style.

  1. Indentation should be in the range of 2-4 spaces. Research indicates that people can read and understand code better with an indent in that range than one which is larger or smaller than that.
  2. When one thing depends on another, indentation should be increased. This makes the core statements obvious.
  3. When easy, line things up.
  4. In any list of things, it should be possible to extend the list by pasting new lines on without changing others. In a language like SQL where you can't have trailing commas, that means putting the comma before the line.
  5. If you wrap, then operators belong at the front of lines. This makes the logical flow easy to see at a glance. (I got this one from Perl Best Practices.)
  6. SQL is usually case insensitive, so use underscores in field names to avoid ambiguous parses of things like ExpertsExchange and ExpertSexChange.

And here is an example of what this looks like in practice.

SELECT s.foo
, t.bar
, x.baz
, CASE
WHEN s.blat = 'Hello'
THEN 'Greeting'
WHEN s.blat = 'Goodbye'
THEN 'Farewell'
ELSE 'Unknown'
END as blat_type
FROM table1 s
JOIN table2 t
ON s.some_id = t.some_id
AND s.some_type = 'wanted type here'
JOIN (
SELECT *
FROM yet_another_table yat
WHERE yat.reason = 'Subquery demonstration'
) AS x
ON t.another_id = x.another_id
WHERE s.blat = 'some'
AND t.blot = 'condition'
;

If you've never thought about how to format SQL this style is likely to have a few surprises. But try maintaining a lot of SQL that has been formatted in this way, and you should find that it works out quite well.

Monday, January 3, 2011

How I do Proofs

I was looking through some old papers, and decided to put online some advice that I gave many years ago about how I tackle mathematical proofs. The advice is at How I Do Proofs. This is only lightly edited from the original that I handed out to a course I was teaching in the late 90s.

The course is the one I described at Teaching Linear Algebra. I never got any feedback on my request for improvements, but judging from how well that class did at proofs I believe that the advice was helpful. As one student I asked put it, "Doing proofs is like filling out a shopping list." That is how it should be in a first course where you are learning how to do proofs.

The class where I introduced this advice was fun. I started by handing out printouts of the advice. I then went through the flowchart at a high level. And then I put up a routine theorem from the text, which was the next thing I was supposed to do in the course. And I said, "OK, now that you've all learned how to prove things, you're going to prove this."

You should have seen the shock on their faces! Some started complaining. So I said, "No, seriously. You will all prove this. Just wait and see. It will work."

Then I began. I asked the one person to say what the first step was. A second person to do it. A third person what the next step was. A fourth to do it. And so on. And every time they did something that could be written down, I wrote down exactly what they said. Before long they had proven a result that none of them thought they could prove!

The third of the homework for that night which was on that day's material (if this statement puzzles you, go back to Teaching Linear Algebra and read about the homework strategy in that course) was very heavy on doing proofs. And I can tell that they referred to the handout by one fact. The next day the grader came to me, very puzzled, and asked me to explain this strange piece of reasoning that everyone had used. It was the contrapositive, which is an interesting alternative to proof by contradiction. One of the problems could be solved either with contradiction or the contrapositive, and they all chose the contrapositive because I'd listed it before contradiction.

I considered this a very good thing. One of the major problems that students have with proof by contradiction is that it works too often. Proofs that require a bit of routine algebra or computation can always be rewritten as proof by contradiction. The rewritten proof is correct, but it is unnecessarily confusing and it is better not to have used contradiction. However because contradiction seems to always work, it becomes a sledgehammer that the student always uses. Without noticing that it is sometimes the wrong tool.

The contrapositive doesn't have this drawback. It can solve the same problems as those that "really" needed proof by contradiction. But it doesn't lend itself to solving problems that never needed proof by contradiction. And so students who have been trained to try the contrapositive do not tend to make their simpler proofs over-complicated.

In fact when I drew up the advice I nearly left out contradiction. But it is so widely used that I figured I had to include it. However I tried to subtly discourage its use. And to the best of my knowledge I succeeded, I'm not aware that any students in my class ever used it in any of their proofs.