Error Handling For Fun And Profit

(The profit is that your code works.)

It was my second permanent job that taught me how to handle errors. It was my third job that taught me I needed to do that all the time.

One of the lengthier tasks I had at the second company I worked for was to fix someone else’s error. There had been a joint project with another large company (the task I had actually been hired for), in the course of which a large number of circuit schematics had been shared. These schematics came in the form of plotter files in an early version of the HPGL format. It was a very simple format; two letter commands were introduced with a dot, and everything was printable ASCII characters except for the ETX (0x03, or Ctrl-C to you) that terminated the text of labels.

Corporate politics being corporate politics, the joint project eventually fell apart with the teams concerned being told very firmly that they were to have nothing more to do with “the enemy”. It was at this point that my employer discovered that the person who had FTPed the schematics over to our site had forgotten to set binary mode on the transfers. All of those ETX characters had been turned into, you guessed it, dots. Since repeating the transfers was no longer an option, I was given the job of turning only the right dots back into ETX characters.

The obvious social engineering solution was out; while I got on well enough with the engineers at the other company and their line managers might turn a blind eye to us going down the pub and just happening to swap floppy discs, I had no way of getting in contact with them. We were in towns inconveniently far apart, I didn’t have any of their phone numbers, and email was a laughable idea back then. So I had to write the code to guess which dots shouldn’t have been dots and change them back, which didn’t take very long, then tune it and send every single one of the fixed files to the big plotter to prove they were in fact fixed, which did.

In other words, I learned how to detect a particular inobvious error condition and deal with it.

In a lot of ways my thinking about errors stalled there for a while. Errors were what happened when you did something wrong, and the obviously correct thing to do was to not make mistakes in the first place. More realistically, you fixed the errors you found and re-issued the software. It never occurred to me that my underlying principle was just plain wrong.

The third company I worked for corrected that idea almost as a by-product. They were a networking company (and some of the same mad people now work for Kynesim), and I learned very quickly that network communication is not reliable. Interference, dodgy timing, sunspots (no I’m not kidding), earth-level mismatches or any of a number of other things could induce bit errors or lose signals entirely, and we had to detect and deal with that. I learned about protocols, basic error detection techniques and what to do with the results. More to the point, I learned that this was a general principle, and to apply it everywhere.

Memory is finite. So if you allocate yourself some dynamic memory, always check that you got it, and decide what to do if you didn’t. If that means giving up on the program, do so gracefully rather than collapsing in an unforgiving heap. The fact that Linux kernels have for a long time lied to your face about whether or not they have given you real memory does not endear kernel engineers to me, and does tend to make me think of them as computer scientists rather than software engineers. (The paper here (https://arxiv.org/pdf/2208.08484 “When malloc() Never Returns NULL”) explains this situation in more depth and argues that I am wrong to try to fail gracefully. I disagree, obviously.)

If you call a library function, always check that it doesn’t return an error and decide what to do if it does. Even on the functions that can’t possibly return errors. Heck, especially on those functions. I learned the hard way that “impossible” errors will happen when it’s most inconvenient, and then sit around quietly laughing at you as you spend days trying to reproduce them.

When you parse the protocol that you wrote yourself as it comes in over the wire, always check that it makes sense and decide what to do with all the different possible malformations. Even if no transmission errors are likely to make it through the underlying layers, the sender may have misimplemented the protocol. Worse, your code may be running on something out in the field and the protocol may have been updated in the meantime. You have to figure out what to do about that. (Which leads to the first law of protocol design; always include a version number. Every time I have flouted this rule, I have paid for it.)

Always, always, always check your return codes for errors, and decide what to do with them. Always. Because they will happen.

(How many of you noticed the error in the cover image?)

Error Handling For Fun And Profit

Read more from this category: