Reliability First - Spacecrafts

Artistic image of Rosetta, Philae and
comet 67P/Churyumov–Gerasimenko
I bet you've heard that last week, for the first time, a human artifact has landed on a comet (named 67P/Churyumov–Gerasimenko). The lander Philae and it's companion the space probe Rosetta of the ESA (European Space Agency) have done a long  and great work. At this page you can see a resume of their ten years of journey in the Solar System.

But it was not a bed of roses. The mission had some troubles, starting from the delayed launch and ended with the not so perfect landing of Philae. There have been some technical issues but Rosetta has been reliable enough to accomplish its duty.

There is a lesson that a developer can learn from this story: create your software as it should survive ten years in space without maintenance. Check every possible failure case and make it work even if the situation is not perfect.

Of course, this is a reminder to me too, since too many times I think the system would never run out of memory or disk space.

Image by European Space Agency licensed under CreativeCommons by-sa 2.0

Reliability First - Embedded Systems

A well-designed watchdog
This post could end with a single word: watchdog. But designing a good watchdog is a challenging task.

A hardware chip that cuts the power supply to the main processor is indispensable to provide real reliability. This chip should be pinged at regular intervals, otherwise a power cycle is done. If well calibrated, this system can be effective enough for single-thread application running on microcontrollers. But for microprocessors with an operating system and several processes running, a software watchdog is needed too.

The Software Side

Obviously the watchdog process (WDP from now on) must be tied with his hardware counterpart. In this way, if the WDP crashes, the system will reboot, ensuring that other processes don't remain without a monitor. This is the easy part; how the WDP checks that everything is working fine is another kettle of fish.

One solution may be monitoring the status of every process (ensuring that it's running and not zombie) and abnormal usage of CPU and RAM. The hard part here is defining what "abnormal" means.

Besides this rough check, we can make each process to feed the WDP at regular intervals. The drawback is that we need to complicate each process inserting code not related to its core business. If it seems a not so big disadvantage, try to imagine the amount of code needed if you have a big process with tenths of threads running concurrently. Unfortunately, this is the price to pay for a really reliable system.

Management Of The Failure

OK, now you have your WDP running on your system with other processes that fed it. The next step is to decide what to do in case of failure. If a program is going to consume all the system resources, an obvious thing to do is killing it. And then?

The answer depends on the process and the system architecture. For some process, the right solution may be trying to restart them; for others a system reboot may be required. Additional rules may be set on the number of failures in a certain time. Probably, in an average system, all these strategies should be applied to different processes.

Reports

After the WDP has done its dirty work, is more likely that the failure will reappear. It can be because of a bug, an unmanaged situation or for a memory leak that is slowly consuming the RAM. A good way to understand what happened is to have a memory dump of the "bad" process to perform a post mortem debug. But unfortunately, often this is not enough.

In a complex system where processes interact, a log that shows information from the last minutes before the WDP intervention can be really useful. This ends in other extra code added to the processes.

Conclusions

Designing an effective and reliable watchdog for embedded systems is a complex task and it often implies additional code added to the other processes. But believe me, it's worth the hassle.

Reliability First - Applications

What does reliability mean in computer science? Speaking about an application, how can we say it is reliable? I don't know if there is a shared opinion but mine has maturated after a scary situation.

Some years ago, on my previous workplace, we created a huge file with a very powerful and even more expensive third party software. But some seconds after having pressed the save button, the software crashed. Panic. We searched for the saved file and we found it. Don't panic. So we restarted the powerful-and-expensive-third-party-software to reopen the file but it failed. We tried several times even on other PCs without success. The (binary and proprietary) file seemed to be corrupted. Okay, panic!


Fortunately we also owned a licence of a similar software, much less powerful and much cheaper (about twenty times cheaper). We had nothing to lose so we tried to open the file with this cheap software and... it worked! All our job was there. So we saved the file with a different name in the cheap software and eventually we were able to open it with the expensive software.

After that incident I have a clear idea of what reliability means when speaking about applications. And you?

Image created with GIFYouTube. Scene taken from movie "Airplane II: The Sequel".