Mission Critical

Jul 28 09

Mission Critical

Paul Weinstein

Over the past week plenty of commentary
has been made, including my own, in relation to the 40th
anniversary of the historic flight of Apollo 11. One comment on
Twitter by alexr
however caught my specific attention, “SW
engineers should take this moment to consider if they’d trust their
code to have gotten the LM onto the lunar surface safely.”

Indeed a sobering thought at first
glance. In this day and age it is quite common to run into a program
written by one or more software engineers that seems unstable or
error-prone. What if all those engineers had to deal with the rigors
of getting their code “flight ready” where billions of dollars
and at least 2 men’s lives at risk?

But on second thought I think this
comment does more harm than good. It seems to imply that either;
the challenges of writing critically important, life-and-death
software only occurs once in a blue moon1
or that the good-old-days of elite superman programmers who wrote
error-free programs are long since gone, replaced by thousands of
mediocre programmers writing millions of bug-infested computer code.

Instances of life-or-death situations
being directed by computers (hardware and software) might not be an
everyday occurrence for a programmer, or even a once in a career
occurrence. But it does still occur. I recall the Professor who
taught my Assembly Language class in college mentioning his work on a
project for Motorola on a car fuel-injection system. The engine had
habit of shutting down when entering the presence of electrical

Just imagine driving down a highway at
55 mph only to have your car shutdown while passing by some
high-tension power lines….. Now consider the added complexity of
today’s hybrid engines.

And while not every programming
challenge is “life-and-death”, plenty of software code in today’s world is “mission critical” with millions, if not
billions, of dollars at stake.2

In any case the coding of the Lunar Module‘s
software was hardly error-free. In fact in regards to the Apollo 11
moon landing two specific instances occurred with the Eagle’s
Guidance Computer during the critical decent to the moon’s surface.

At 102:38:30 Neil Armstrong calls out a program alarm, “1202”. Ten seconds later,
Armstrong is asking for feedback from Houston on the error
Houston gives the astronauts a “go” to continue their decent.
But less than 5 minutes later, with 2000 feet separating the LM from
the surface the ship’s computer issues a “1201”.

102:42:13 Armstrong: (on-board): Okay. 3000 at 70. 

102:42:17 Aldrin: Roger. Understand. Go for landing. 3000 feet.

102:42:19 Duke: Copy.

102:42:19 Aldrin: Program Alarm. (Pause) 1201

102:42:24 Armstrong: 1201. (Pause) (On-board) Okay, 2000 at 50.

102:42:25 Duke: Roger. 1201 alarm. (Pause) We’re Go. Same type. We’re Go.

Second round of system issues.

What was a 1201 and 1202 type error?
Only that the Apollo Guidance Computer was indicating that it was
overloaded with data inputs, couldn’t keep up and was resetting

Yeap, that’s right, the guidance
computer for the LM rebooted, at least twice, during one of the most critical phases of the mission because
it ran out of memory.

The problem? An error in one of the
crew’s check lists had them turn on the rendezvous radar during the
landing phase. Of course the LM crew was hardly trying to rendezvous
with the Command Module during their decent, but the repeated calls
to the computer to process imaginary rendezvous radar data filled up
the limited writable computer memory4
the on-board system had, causing the system to repeatedly restart.

Now I suppose somebody will argue that
the computer was hardly to blame. It was a user-generated error
turning on the rendezvous radar (or a documentation error) not a
computer programmer error. Moreover the program was designed to
reset itself if it got overloaded on purpose.5

But, that’s just it. No programmer, not
matter how good, can take into account every possible error or
misuse, whether created by the programmer or the user.6 Would you have
considered at first that an over-head power line might scramble your
car’s fuel injection system?

This is where the programming concept
of fault-tolerant programming comes into play. The idea is pretty
basic; enable the system to continue operating properly in the event
of an error. Just as the Apollo spacecraft (and Saturn V
launcher) had mechanical backups to keep the physical system running
in case of failure the guidance program (and properly designed
programs of today) manage the error and keep it from causing
catastrophic results.

Thus the statement should not be, engineers consider if you’d trust your code to get the LM to
and from the moon safely. Instead it is, do you consider your
software fault-tolerant enough to get one to the moon and back

Interesting side note there is a community programming effort that has created a software emulator of the Apollo hardware and software, Virtual AGC and AGS.

1 Pardon
the pun.

2 And by indirect implication the lives of the employees, customers, stockholders,
et al.

3 Sounds
eerily familiar for any modern day computer user; the computer reports some
cryptic error code and the next step is to go searching for
additional information on what’s gone wrong.

4 Now
a days we classify memory as Read-Only Memory (ROM) or Random Access
Memory (RAM) and talk about Gigabytes (109) of RAM for a laptop (or even a
smartphone). The Apollo Guidance Computer? About 64 Kilobytes (103) of ROM
and only 2 Kilobytes of writable RAM.

5 The idea was to clear the fault and reestablish import tasks, i.e.
clear out the waiting calls for calculating unnecessary rendezvous
telemetry and reestablish jobs for processing landing telemetry.

6 And just in case you wish to insist that the programmers of yesteryear were superman, well turns out one uncorrected bug could have crashed the LM by trying to flying the craft first “under” the surface then back “over” the surface and then “onto” the surface for a safe landing.