Mmmm . . . flaky

You know you’re a tech geek when the most exciting thing that happened to you during the week is GMail’s new offline mode. What struck me most about the announcement, though, is not that they added offline mode, but that they added “flaky connection mode”. Flaky connection mode attempts to bridge the gap between online and offline. Maybe I’m online right this second, maybe I’m not, so GMail works in offline mode and attempts to quietly synchronize in the background. If I’m online, great, but if my connection bombed, no problem, it will just queue up my changes and try again soon.

As devices get smarter and more portable and more connectable, they’re also going to get flakier. Laptops and the various mobile platforms and smart devices continue to push the envelope of the possible, but they all need to deal with the fact that sometimes you’re on an airplane, sometimes you’re in a tunnel, and sometimes, well, your connection is just plain flaky.

Google has built the most powerful software platform on earth by embracing flakiness. When you perform a search, that request may be farmed out to 1,000 different servers, but the platform doesn’t for one nanosecond assume that all those requests are going to succeed. Failures are expected, and the platform compensates by knowing how to identify them (bad responses, no responses) and work around them (re-send the work to a box that isn’t misbehaving). Flaky connection mode simply extends this server-to-server model down to server-to-client. It’s a recognition that we’ve moved incredible computing power from the tightly controlled confines of the data center and put it out there in the flaky, flaky wild.

When you build internet-scale distributed systems, you should always assume you are in flaky connection mode. Maybe the tubes are down today. Maybe your vendor’s server went down. Even with all the contracts and SLAs and angry phone calls in the world, you fundamentally don’t have any control over that box staying up and reachable when you need it.

I’m working on an application right now which communicates with numerous internal and external web services, each owned by different teams and companies. The business process I need to handle isn’t kicked off by a human, so I can’t ask the user to try again if I hit an error, and I can’t force distributed transactions across these systems, so starting over means I’d duplicate lots of calls.

There were two ways for me to meet the business requirements. The first option was to demand fault-tolerant servers, redundant hardware and software, bug-free software from several different companies, and pixie dust to sprinkle on it all to ensure that nothing ever went down. I went with the second option: expect flakiness and deal with it. Using commodity hardware and software I was able to achieve a very high degree of fault/flake-tolerance as follows:

  1. Break the business workflow down into the most atomic units possible. First I need to acknowledge the job request from System A. Then I need to retrieve the data from System B. Then I need to send it to vendor’s System C. Then I need to retrieve the response from System C and pass it to System D.
    etc.
  2. Treat each of those workflow steps transactionally and durably. Either I have stored the job request from System A in my database and sent my acknowledgment of receipt and updated the job’s status in my database, or I must assume that the step was not begun at all.
  3. Deal with the fact that #2 guarantees I will never miss a step, but it may occasionally cause me to execute it twice. Maybe I did store the job request from System A and send my acknowledgment but I blew up at the very end when I updated the job’s status. If I can’t make that step idempotent, it’s my responsibility to deal with the consequences of duplicate runs. The more atomic I make the steps, the easier this is to do.
  4. Have my system understand the difference between system errors and business errors. Because my app is mainly hitting HTTP REST services, this is relatively easy. 400 series errors
    mean the client (i.e. me) is wrong, and the job is doomed, so I abort it. 500 series errors, garbage responses, or simply no response at all tell me that the server is temporarily acting flaky and I should try again in a little while when I . . .
  5. Have a sweeper process that runs in the background and calls the next step on any job that hasn’t been updated in, say, 15 minutes. No matter where a job died, all this process needs to know is to call the job’s CallNextStep method and it will resume after the last completed step.
  6. Recognize that sometimes, success isn’t in the cards. Let’s say that after 20 tries over 12 hours, I’m going to raise a notification to our operations people that this one was just too flaky for me and they need to resolve it manually.

My application will never have to worry about the machine on which it’s running going into a metal elevator, the neighbor changing his wi-fi password from “linksys”, or the complexity of running on a 1,000 node Google cluster. But the principles of flaky connection mode apply to any application with any degree of inter-system dependency. Designing for flakiness allows you to build highly reliable and fault tolerant systems out of very flaky parts.