Tuesday, November 06, 2007

Technical - "Release It!"

This is another very nice book from Pragmatic Programmers written by Micheal T. Nygard. The main topic of "Release it!" is about putting software into production and what can happen there.

The book discusses a number of principals to consider when setting up large production systems. They are divided into anti-patterns (what not to do) and patterns (what to do). The two major parts of the book deal with stability - what to do to prevent systems from crashing, and capacity - how to design systems that will not crumple under load.

But perhaps the best part of the book is the examples of failures that occured in the real world and how those could be avoided. This is in tradition of best engineering books - which analyze failure in order to avoid it in the future.

Here is a one example. The author had to find a problem with a web based system that would "crash" every morning because its connections to a backend databases became "broken". Typically these sorts of systems have a pool of available connections that sit idle and are used when needed. The idle connections are kept, because creating a new connection to a database server is slow.

Now a connection to a database server means a TCP/IP socket connection. When a socket connection is established the two end system agree to communicate over a certain route, and the data between them can pass via some other computers or routers etc. However, when the connection is idle no data is sent at all.

Now it turns out that there was a firewall system between the web application and the database server (a typical set up). A firewall needs to keep track of all the connections that go through it. Firewalls are limited in how many concurrent connections they can handle, and if they see a connection that is idle they will drop it after some suitable time interval (could be hours of inactivity).

It turned out that during the night there was not enough activity in the system to use the pooled connections to the database, so the firewall silently dropped them. Then when people began to use the system in the morning all of a sudden the connections to the database were gone and the system had to be restarted.

The solution was simple - database connections had to be kept active by periodically sending some data to the database server. But this is not the kind of thing that people think about while developing systems.

As with most of the Pragmatic Programmer books I found this one very useful and entertaining.