Reliability First - Embedded Systems

A well-designed watchdog
This post could end with a single word: watchdog. But designing a good watchdog is a challenging task.

A hardware chip that cuts the power supply to the main processor is indispensable to provide real reliability. This chip should be pinged at regular intervals, otherwise a power cycle is done. If well calibrated, this system can be effective enough for single-thread application running on microcontrollers. But for microprocessors with an operating system and several processes running, a software watchdog is needed too.

The Software Side

Obviously the watchdog process (WDP from now on) must be tied with his hardware counterpart. In this way, if the WDP crashes, the system will reboot, ensuring that other processes don't remain without a monitor. This is the easy part; how the WDP checks that everything is working fine is another kettle of fish.

One solution may be monitoring the status of every process (ensuring that it's running and not zombie) and abnormal usage of CPU and RAM. The hard part here is defining what "abnormal" means.

Besides this rough check, we can make each process to feed the WDP at regular intervals. The drawback is that we need to complicate each process inserting code not related to its core business. If it seems a not so big disadvantage, try to imagine the amount of code needed if you have a big process with tenths of threads running concurrently. Unfortunately, this is the price to pay for a really reliable system.

Management Of The Failure

OK, now you have your WDP running on your system with other processes that fed it. The next step is to decide what to do in case of failure. If a program is going to consume all the system resources, an obvious thing to do is killing it. And then?

The answer depends on the process and the system architecture. For some process, the right solution may be trying to restart them; for others a system reboot may be required. Additional rules may be set on the number of failures in a certain time. Probably, in an average system, all these strategies should be applied to different processes.

Reports

After the WDP has done its dirty work, is more likely that the failure will reappear. It can be because of a bug, an unmanaged situation or for a memory leak that is slowly consuming the RAM. A good way to understand what happened is to have a memory dump of the "bad" process to perform a post mortem debug. But unfortunately, often this is not enough.

In a complex system where processes interact, a log that shows information from the last minutes before the WDP intervention can be really useful. This ends in other extra code added to the processes.

Conclusions

Designing an effective and reliable watchdog for embedded systems is a complex task and it often implies additional code added to the other processes. But believe me, it's worth the hassle.

When Unit Tests Fail

This week, my colleague +Giancarlo B. showed me this short function.
char *unescape(char *in)
{
        char *tmp;
        int i, x;
        char b[5];
 
        tmp = calloc(strlen(in), sizeof(char));
        x = 0;
        for(i = 0; i < strlen(in); i++) {
                if(in[i] == '+')
                        tmp[x++] = 32;
                else if(in[i] == '%') {
                        memset(b, 0, 5);
                        strncpy(b, &in[i + 1], 2);
                        tmp[x++] = (char) strtol(b, NULL, 16);
                        i += 2;
                } else {
                        tmp[x++] = in[i];
                }
        }
        tmp[x] = 0;
        return tmp;
}
It's purpose is to convert a string such as "this+is+a%20space" into "this is a space". The function works pretty well provided that the input string contains at least a "%xx" sequence. If not, the output allocated string is one character too short. To fix this, it's sufficient to substitute
        tmp = calloc(strlen(in), sizeof(char));
with
        tmp = calloc(strlen(in) + 1, sizeof(char));
The thing that makes this bug special is that it can only be found by looking at the code. Let's see why.

The Speed Of calloc

Image by Martin Maciaszek https://www.flickr.com/photos/fastjack/The first thing to understand is how calloc works. I've learn this thing a couple of years ago when searching for the differences in speed with malloc + memset. Basically, calloc returns a pointer to a memory area that belongs to an already blank page, so there is no need to clear it, saving time. At this link there is an extended explanation.

This means that is (almost) guaranteed the next byte after the memory returned by calloc is blank. Or, in other words, that the string is NULL-terminated. But unfortunately this is true only until another calloc is called.

This second call is likely to return a pointer to the first unallocated byte that can be later changed into something different from NULL, generating unexpected behaviors.

False Negative

Now you should have understood why unit tests can fail here. Suppose you have this code:
int do_test()
{
        int err = 0;       /* 0 = no errors */
        char *s1 = "this+is+a%20space";
        char *s2 = "thisisnotaspace";
        char *t1 = "this is a space";
        char *t2 = "thisisnotaspace";

        char *r1 = unescape(s1);
        if (strcmp(t1, r1) != 0)
                err = 1;   /* error on the fist test */

        char *r2 = unescape(s2);
        if (strcmp(t2, r2) != 0)
                err = 2;   /* error on the second test */

        free(r1);
        free(r2);
        return err;
}
I expect that this function always returns 0, that means no errors. But obviously this is wrong. And, even if I know a couple of way to modify the test code to catch this particular error, it's not said that enabling or disabling some compiler flags we obtain the same behavior.

Conclusions

I really believe that unit tests are useful in order to find a wide range of bugs, but, in some situation, developer's experienced eyes are indispensable.

Should We Forget IT Security?

In the last months, a high number of security flaws has been reported. Starting from bugs in the management of SSL/TLS protocol (Apple and GNU), continuing with Heartbleed, Shellshock, BadUSB up to POODLE, just few days ago.

So now you may ask: what the hell is happening here? Why are there so many threats in such a short time? Are today developers less skilled than their predecessors?

Well, I simply think that we are looking at the wrong side. Many of the above bugs are there from years and nobody reported them (even if I think that they have been used by someone for illegal purposes).

But lately, there is a great request for security. The reason is simple: the amount of money that every day flows through internet. This is why the attention to security has grown so dramatically and thus the number of bugs being found.

So, what should we expect from next months? In my opinion, even more security flaws will be disclosed. And this is a really good thing.

Insanity And 4 Other Bad Things

Dilbert by Scott Adams
The definition of insanity is doing the same thing over and over and expecting different results.
Some say this sentence has been first pronounced by Benjamin Franklin, others attribute it to Mark Twain or Albert Einstein. They all are wrong. But the quality of the people to whom this quote is ascribed should tell you something about its correctness.

There is also an ancient Latin maxim (by Seneca) that states a similar concept:
Errare humanum est, perseverare autem diabolicum et tertia non datur.

To err is human; to persist [in committing such errors] is of the devil, and the third possibility is not given.

[Thanks to Wikipedia]
With this premises I have to conclude that the Devil is causing so much insanity in the world nowadays. Take this as a general discourse but it seems to me that many people keep doing the same things in the same old way, facing every time the same problems and delays without understanding that things can go really better just changing few things in their way of acting.

Excluding supernatural interventions, in my experience, this kind of behavior is mainly due to four reasons.

1. (Bad) Laziness

Not that kind that makes you find the fastest solution to solve a problem. This laziness is absolutely harmful; it's the concept of comfort zone amplified to the maximum. "I don't wanna change!" and "I don't wanna learn anything new!" are his/her mantra.

Every change in procedures is considered a total waste of time and a new developing environment is simply useless. If you have a couple of people of this kind in your team, you can be sure that every innovation will be hampered.

To overcome this behavior you can try to propose a total revolution in order to obtain a small change.

2. Arrogance

"I'm sure I've made the right choice!" no matter if this decision has been made years ago and now the world has changed. By the way, the initial choice may have been wrong from the beginning but nothing can make him/her change his/her mind. Probably this has something to do with self-esteem.

It's quite impossible to work together with this kind of developers, since they will never admit their faults and they'll try to put the blame on others.

Sometimes a good strategy may be to suggest things as they have been proposed by the arrogant himself.

3. Ignorance

There's nothing bad in not knowing something. The problem is when he/she doesn't care about his/her nescience (see point 1), when he/she doesn't want to admit it (see point 2) or when he/she doesn't trust others' suggestions.

This last point may seem a little strange: if I don't know something, I have to trust on someone that is more informed or skilled than me, right? Unfortunately it doesn't work this way. If you need a demonstration, search "chemtrails" on Google.

I don't have a suggestion on how to minimize the impact of these guys in your team. Maybe a training can be useful but the risk is that they don't trust the teacher.

4. Indifference

This is the worst, especially if referred to a manager. He/she doesn't care about the feeling of his/her subordinates. "There is no need they should be happy doing their job" and "It's not a problem if they spend more time than what's needed in trivial tasks that can be automatized" are his/her thoughts when someone is complaining.

I don't know if there is some sadism in this behavior, but it's quite frustrating. And it's very bad for the team and for the whole Company.

Conclusions

During my life, I've had the "opportunity" to work with people belonging to one or more of the above categories and I can assure that the last is the worst. You simply cannot team up with someone that doesn't care about you.

Suggested complementary read: Is Better Possible? by Seth Godin.

Horror Code - Why?

while (x >= 0) {
        x--;
        y--;
}
I've only a question: why?

ShellShock: Impact On Average People

In the previous post, I've written about the ShellShock vulnerability in a general way. Now I want to talk about how this vulnerability can impact all the average internet users.

So the question is: what can you do to protect yourself when surfing the web? The same good old things.

Check Your Router

As said in the previous post, there is a remote possibility that your router (if you have one) is vulnerable. To understand if you are at risk, the best thing to do is  is to take a look at the producer website. If you are lucky enough, a patch is already available. In any case, you should try before you trust.

Offline tests:

Online tests (not recommended - it's not a good thing to let someone know that your router can be attacked):

Use An Updated Browser

Since ShellShock vulnerability can be used to inject malicious code in trusted websites, this probably will result on several tries to take advantage of old and new known browser breaches. If you keep your browser always up to date, you'll be less vulnerable. Avoiding Internet Explorer is a good solution too.

Something should be said also for two products that usually act as plugins for the browser: Java and Flash. There are plenty of exploits based on vulnerability of these two products so it's better to disable them by default and allow their execution only if they are really needed.

Use An Updated OS

I know that you feel comfortable with Windows XP but you should know that Microsoft is not providing security patches anymore. This means that every vulnerability being discovered will never be fixed.

[If you feel comfortable with Windows Vista, please contact a doctor <grin />]

Use An Updated Antivirus

Nowadays AVs are smart enough to detect a wide range of malicious web attacks, even unknown ones with their heuristic algorithms.

There are plenty of good free and non-free antivirus out there: pick one and install it. An average AV is better than no AV.

This suggestion is basically for Windows and Adroid users but Mac addicted should worry too.

Conclusions

As you can see, all the above suggestions give you  more or less the same hint: keep everything up to date. This is because security is a process. This means that there is nothing that can be considered truly attack proof except if it is turned off and with the cable (or the battery) unplugged.