A Prayer for Distributed Systems Developers

Over 16 years, I’ve written software up-and-down the entire stack. Earliest in my career I wrote boot ROM software for specialized embedded devices. This kind of programming taught me so much about how computers really work.

Most programmers think they know how computers work, when they really only know how software works. These programmers don’t really know how an Interrupt Service Routine is invoked, how a computer actually adds two numbers together, or how a microprocessor loads the next instruction into RAM. But that’s OK, they don’t need to, nor do they care.

So for many years I wrote embedded software but, like a balloon caught in an updraft, found myself floating up the software stack. Hung out in kernel-space for a while, then camped out for a few years in user-space, and most recently finding myself atop an enormous stack of software. But just writing embedded software teaches one little about systems — and looking back down such a tall stack might cause the mind to reel — don’t look down, as they say.

But standing on top of that stack, and looking down, has got me thinking about the two types — and possibly a third — of software engineering I’ve seen along the way.

Type 1 Software Engineering

Let’s call what I am about to describe Type 1 Software Engineering.

Embedded software is an example of this. Because the input conditions are so bounded, it is much easier to show that the software is correct, to a very high degree. Being so close to the hardware, it benefits from being close to components whose behavior has been precisely measured. During my career I’ve written thousands of assembly instructions, which read from, and wrote to, CPU registers. Those instructions always executed in exactly the same amount of time (technically the same number of clock cycles). The idea of monitoring the responsiveness of a read-modify-write cycle is obviously nonsense from a programmer’s point of view (though not so for the microprocessor designer!).

When an embedded system is running — even a larger one like a digital video recorder, it feels somewhat mathematical. Signals go in, signals come out. Buttons are pressed, displays change. It’s constrained and it leads to a certain mindset. The challenges are still significant. The systems are often physically very small, the hardware often highly customized, and the tools relatively few. Sometimes one has to debug the software using an oscilloscope — something I have done.

But the programming problem — the product — often reaches a point of completion, a time when you can say that system is finished and I know how it will behave in all circumstances.

Type 2 Software Engineering

Now let’s talk about what I call Type 2.

Type 2 involves programming — and designing — large distributed systems. Function calls may involve multiple network hops, over cross-country connections. Latencies are scattered throughout the components, calls to external services may take microseconds, or they may take minutes. Sub-systems may be close to failure, but appear to be fine. This type of software design is much more prevalent today, with the advent of easy access to large scale computer systems.

In my experience there is only one way to engineer these systems. It’s to take all you’ve learned from programming, and then become a Systems Engineer. Every attribute must be measured and monitored, every action timed and recorded. You must learn about feedback loops. It much more akin to real-world mechanical engineering. In this model the operating parameters of the software are chosen — and often continually changed — by paying attention to how the system is performing, and modifying the parameters as needed. Coupling a solid design to on-going measurement and analysis, and then feeding that back into the operation of the system — this is what I think of as Type 2 Software Engineering.

Ideally, Type 2 is a super-set of Type 1. However Type 2 Engineering breaks down when operational characteristics are the only driver of change, without any attempt to understand why the system is performing the way it is — call this Type 3 Software Engineering or more succinctly, guessing. This often happens when time is short, when a team is under stress — or the system close to failure.

A Serenity Prayer for Distributed Systems developers, perhaps?

God, grant me the design to deal with the things I cannot change,

The code to change the things I can,

And metrics to know the difference.

Vallified

Leave a Reply Cancel reply

Philip O'Toole