Friday, December 21, 2007

Top Ten, 8 of 10

When we last left our hero, we were looking at ten ways to get screwed by the "C" programming language. Today's entry is Easily changed block scope.

if ( ... )
foo();
else
bar();

When adding a line of code to either the if or the else clause, one needs to add braces. If you add to the else clause this way:

if ( ... )
foo();
else
i++;
bar();

the i++ statement is in the else, but bar() gets called no matter what. One needs to add braces.

if ( ... ) {
foo();
} else {
i++;
bar();
}

OK, so the braces aren't needed for the if clause here. But since this is the way that the language works, and since braces can be placed anywhere, and since braces do not slow down the compiled code, or add anything to the code size, why not put them everywhere? It reduces the chance of error, increases consistency, and makes editing easier.

There are C programmers who skip braces for one-line if statements:

if (...) doit();
if (...) doit2();

but not

if (...) { doit(); }
if (...) { doit2(); }

despite C's terse block structure. I'm not one of them. I don't use one line if statements.

Thursday, December 20, 2007

Top Ten 7

When we last left our hero, we were looking at ten ways to get screwed by the "C" programming language. Today's entry is Indefinite order of evaluation.

Be sure to also read the supplementary dialog on the subject. I personally come down mostly on the side of the Respondents, and by extension, with Dennis Ritchie's original language design decisions. This may reflect the fact that i came to C after writing a boat load of assembly language for various processors. And yet, there is still wiggle room for yet another opinion. Here it is.

I'd have expected that the order of evaluation in function arguments would be left to right, no if's, and's, or butt's. Here's why. Function arguments are separated by commas. In C, the comma operator is the left to right evaluation order guaranteed syntax. So, in the context of function arguments, the comma operator is effectively overloaded to mean that the order is not specified. That's inconsistent at best.

A related complaint of mine with the C language is the overloading of the while keyword. In my opinion, it shouldn't have been used for both do{...}while(); and while(){...} loops. My objection has to do with reading a program in a free form language that has been badly indented. It isn't a strong objection.

While it would be nice to have a tool, ideally in the compiler, that warns that undefined behavior may take place, it's hard to imagine how such a tool could work, even slowly. Certainly, a simple example like:

foo(i++, i++);

would be easy to catch. But trivial examples like this are also pretty easy to catch by eye. It's the more complicated examples that would be worth having a tool.

And yet, in the last half million lines of C code i've written, and in the millions of lines of C code i've examined, this yawning chasm waiting for someone to fall into continues to wait. Uninitialized variables are much more common yawning chasms. Languages that provide initial values for variables even if they aren't explicitly set have fewer bugs. C is the fusion powered chain saw without safety gourds. It will happily let you hack your own legs off. At the same time, it has historically set the gold standard for speed.

Wednesday, December 19, 2007

Top Ten 6

When we last left our hero, we were looking at ten ways to get screwed by the "C" programming language. Today's entry is Unpredictable struct construction.

There are several issues that lead to the exact way that C structs are laid out in memory by any C compiler. Yet, it should be noted that, as a systems language, C was explicitly designed to allow one to set a structure pointer to map to a hardware device so that bits in that device can be manipulated. The first issue is that computer designers have two choices on how to lay out the bytes of a larger integer. These are Big Endian and little endian.

#include

int
main(int argc, char *argv[])
{
long abcd = 0x01020304;
char *ord = (char *)&abcd;

printf("%0x:%0x:%0x:%0x\n",
ord[0], ord[1], ord[2], ord[3]);
return 0;
}

Hex notation is used to set the long integer abcd. Each two characters of hex specifies eight bits, or one byte in memory. On an x86, Arm, and other machines, this prints "4:3:2:1". The character pointer points to increasing addresses. The first byte printed is 4, which is the value of the least significant byte in the long. This is little endian, meaning that the least significant byte goes into the smallest address. On a Sparc, 68000, and other machines, this program prints "1:2:3:4". That means that these machines are big endian. So if one creates a struct with a long integer (perhaps of type int32), writes it on an Arm, and reads it on a Sparc, the Sparc will read the bytes backwards from what the Arm wrote. The Sparc would have to reverse the bytes in each word read in order to be able to get the original meaning.

Some file formats allow either big endian or little endian storage. For example, the TIFF image format allows either. The file format has an indicator for which format was used. The reader program then knows if byte reversing is needed. The reason TIFF allows either is that very often, the same computer is used to read and write the image. If the computer uses the native format, then no reversing needs to be done. This is more efficient.

There are other problems with structures. One is word alignment. Consider this structure:

struct foo {
char first;
long second;
}

Most 32 bit machines read 4 byte long integers faster if they are aligned on an address that is divisible evenly by 4. That's because that's how the memory is physically addressed. If the word isn't aligned, then two words of memory need to be read in order to get the desired word of memory. Some computers take this farther. If an attempt to read a word is not word aligned, the computer generates a fault, and processing is halted. That way, all working programs are faster. On such machines, the C compiler makes sure that structures are always maximally aligned. Padding bytes are added to make sure any long integers are word aligned. So, the above structure is actually represented this way:

struct foo {
char first;
char pad[3];
long second;
}

The pad bytes are not directly usable by the program. They are, in fact,
not named. Now, if one doesn't really care how the bytes are ordered, one can always put the word aligned parts first:

struct foo {
int32 second;
char first;
}

This ensures that no padding will happen. Then if the structure is written to a file, and read from a file on another machine, one can be pretty sure that everything will at least start in the right places. One still has to deal with the Big Endian or little endian issue.

In the old days, pretty much ending with the Vax, binary floating point representations were nearly as varied as the number of different machines available. One solution to this problem was to write them out as text. For example, "-123.456e-21" specifies a very small negative number. The reading program would parse that with sscanf(3) or something, and the result would be a native format number. This parsing can be slow, but having the program work is important. However, since about when the 68000 and x86 got floating point, IEEE754 floating point has been adopted by nearly all chip makers. This format is uniform at the binary level across essentially all platforms today. There is no byte ordering problem. In fact, in 1990, a company that made databases optimized to work on CDROMs used IEEE754 floating point binary as the interoperable format. It supported a large number of computer platforms, and none of these platforms had to do anything special to make it work, unlike for long integers.

Could these issues byte you? Sure. The C language has many, many nods to optimization that turn out to be issues one must think about all the time. If you don't like to think, consider learning idioms to get specific tasks done, and simply reuse them as needed. That makes C a copy and paste language for you. However, if thinking isn't something that turns you on, perhaps programming isn't the right profession for you. Or, consider that C is a poor language choice.

Tuesday, December 18, 2007

Top Ten 5

When we last left our hero, we were looking at ten ways to get screwed by the "C" programming language. Today's entry is Phantom returned values. The example is:

Suppose you write this

int foo (a) {
if (a) {
return 1;
}
} /* sometimes no value is returned */

Generally speaking, C compilers, and C runtimes either can't or don't
tell you there is anything wrong. What actually happens depends on the
particular C compiler and what trash happened to be left lying around
wherever the caller is going to look for the returned value. Depending
on how unlucky you are, the program may even appear to work for a while.

Now, imagine the havoc that can ensue if "foo" was thought to return a pointer!

Rubbish. The bit about random stuff getting returned is true enough. But C compilers can mostly tell that sometimes a value isn't returned from a function that is declared returning a value. While gcc does not report any warning by default, the -Wall option yields the following warning:

return.c: In function 'foo':
return.c:5: warning: control reaches end of non-void function

That is, if a function is declared void, it doesn't have to return a value, and therefore it can drop off the end. In C, one can't return a value by dropping off the end. One must use a return statement; So, for example

int bar(a) {
return 1;
while (1) {
;
}
}

This does not produce a warning. If you remove the return statement, it still does not produce a warning, because the compiler knows that the while statement is an infinite loop. If you remove the while statement but leave the return, the compiler again knows that the end of the function is not reached. The return statement itself causes a return, and returns a value. If you change the return statement to just return;, the compiler warns this way:

return.c: In function 'bar':
return.c:2: warning: 'return' with no value, in function returning non-void

All this to say that compilation with -Wall can catch errors that might otherwise consume your time. If you fail to use the option, shame on you. If you use the option and ignore the results, shame on you. In the early days, these warnings were available from a utility called lint, which knew the language but did not actually produce an executable. Presumably, the thought was that burdening the compiler with warning reporting was too much for each iteration. But the result was that people didn't check for errors as often.

Yet, there is a common case where you have a function that is declared to return a value, but you often know that the return value isn't used. Worse, you sometimes know that this function won't need it's arguments, but it is declared with them anyway.

#include <stdio.h>

int main(int argc, char *argv[])
{
printf("Hello, World.\n");
}
produces these warnings
hello.c:4: warning: return type defaults to 'int'
hello.c: In function 'main':
hello.c:6: warning: control reaches end of non-void function

Now, main returns a value to the environment. That is, the parent process, often the shell, gets the return value, and may use it for loop control, error reporting or whatever. But sometimes you know that it won't be used. And the same is true for the arguments. The above program does not need to know the command line arguments. If you write main without a return value, the compiler will complain at you.

hello.c:2: warning: return type of 'main' is not 'int'

What to do? Best advice is to return a value anyway.

In the original top ten, the return statement was written this way:

return(1);

While this works, and does what is expected, the parenthesis are not needed. The return statement is a statement, not a function, and does not need parenthesis around the argument(s). I like a style where functions have the parenthesis right next to their name(s), return statements have none, and loops such as for (...) and while (...) have a space between the keyword and the parenthesis. Visually, the reader can learn to recognize functions as functions without thinking about them too much. This convention and is never enforced by compilers, so consistency, the hobgoblin of little minds, must be enforced by self discipline. As a matter of style, i'd like to think that i'm not nearly arrogant enough to think i've got no hobgoblins. And besides, it's not for me to judge if my mind is little or not. It's a job for God, if she cares.

I am, however, qualified to judge C compilers. The bar for C compilers is gcc. If the compiler you supply isn't as good as gcc in some way, you have little excuse. The gcc compiler is available for free in source form. You can download the source, find out how it behaves, examine the source code, and you can update your compiler with that knowledge. If your product isn't as good as one available for free, eventually your very smart customers will use the superior free one. That doesn't mean, for example, that gcc won't optimize some corner case slightly better than yours once in a while. gcc is, after all, very good. The bar is quite high for C compilers.

Monday, December 17, 2007

Top Ten 4

When we last left our hero, we were looking at ten ways to get screwed by the "C" programming language. Today's entry is Mismatched header files. The example is:

Suppose foo.h contains:
struct foo { BOOL a};

file F1.c contains
#define BOOL char
#include "foo.h"

file F2.c contains
#define BOOL int
#include "foo.h"

now, F1. and F2 disagree about the fundamental attributes of
structure "foo". If they talk to each other, You Lose!

I've seen this sort of error in released commercial software. How embarrassing. Often, a library is shipped in binary form, with header files that describe the structures and calls used. For no apparent reason, these libraries often have huge numbers of header files. And the above sort of problem can happen.

My own tendency is to have a single header file that describes everything. The single header file can't be self inconsistent, as the compiler would show such errors right away. Besides, it's just easier. It's easier for the library developer, as there is less to cross check. It's easier for the library user, as there's just one header file to include.

And yet, i've written a set of four libraries that are somewhat intertwined. A low level linked list library stands on it's own. A library with routines that handle errors needs the linked list library. A library that deals with comma separated files uses both the error library and the linked list library. A library that provides interfaces needed to write CGI programs, and also access to various databases also uses all three others. As each library has non-overlapping focus, the headers are not in any danger of providing conflicting definitions. But there are still four header files. For one thing, you might want to use just the linked list library alone. I've done this.

Friday, December 14, 2007

Top Ten #3

When we last left our hero, we were looking at ten ways to get screwed by the "C" programming language. Today's entry is 3 Unhygienic macros. The example is:

#define assign(a,b) a=(char)b
assign(x,y>>8)
which becomes
x=(char)y>>8 /* probably not what you want */

If the macro were instead written this way:

#define CHAR_ASSIGN(a,b) (a)=(char)(b)
CHAR_ASSIGN(x,y>>8)
it becomes
(x)=(char)(y>>8)

which might be what was desired. One wouldn't call it assign since it has a known side effect, which is the cast to the type char. One generally uses ALL UPPER CASE for macros, since that is the convention that nearly everyone uses in C for macros.

One might wonder why the original isn't what you want. In case it isn't obvious, here it is. If you cast an integer that is bigger than a byte to a one byte char, then shift it right 8 bits, you always get zero or -1. That's because a shift right of eight bits is the size of a char. Negative numbers shift right, but copy the sign bit (which is 1), positive numbers shift in zero bits. So at the end, you get 0 or -1.

Of course, since macros are text expansions, one tries to pass the simplest expressions to them. Some macros use their arguments more than once. So if you pass such macros an argument that has side effects, you could end up with the effects taking place more than once. For example,

#define DOUBLEIT(a) ((a)+(a))
DOUBLEIT(it++)
becomes
(it++)+(it++)

which increments the variable it twice, even though it looks like it happens just once. And, who knows what the return value is?

Thursday, December 13, 2007

Top ten #2

When we last left our hero, we were looking at ten ways to get screwed by the "C" programming language. Today's entry is 2 Accidental assignment/Accidental Booleans. The example is:

int main(int argc, char *argv[])
{
int a, b, c;

if(a=b) c; /* a always equals b, but c will be executed if b!=0 */
return 0;
}

to which i say that gcc -Wall says this:

two.c: In function 'main':
two.c:5: warning: suggest parentheses around assignment used as truth value
two.c:5: warning: statement with no effect

Two warnings! The first suggests something like:

if((a=b) != 0) c;

This is what is implied by the code, though probably not desired by the coder. The second warning, statement with no effect means that the evaluation of the integer 'c' does not produce any machine code. There's no usable side effect. Nothing. But hey, this is just a contrived example, and is beside the point.

As early as Borland's Turbo C 2.0 for DOS (now available for free), we had compilers that would warn about this. And, it has saved people much time and effort. These days, gcc's -Wall option is good practice.

I happen to like the feature. I use this sort of thing all the time:

if ((p = getenv("DOCUMENT_ROOT")) != NULL) {

which says copy the return from getenv() to the pointer p, and then check if it is NULL, which is an error to handle. The parentheses are there just as gcc suggests. Without them you get the return of getenv compared to NULL, and that boolean assigned to p. Probably not what you would want.

The main problem over the years with gcc -Wall is that your distribution's header files change out from under you, and old code that compiled clean periodically needs to be cleaned up again, even though nothing is wrong, and nothing has changed. Worse, at times, the distribution header files make it impossible to get through a compile without noise. Why is that bad? If you are expecting warnings, you start ignoring them. No news is good news.

Wednesday, December 12, 2007

Top ten

The top ten ways to get screwed by the C language has 18 entries. I was a bit surprised that entries weren't numbered starting at zero. Entry 15 (which you might note is more than ten) talks about using an array past it's end.

Today, let's start with number one.

#1 Non-terminated comment, "accidentally" terminated by some subsequent comment, with the code in between swallowed.

a=b; /* this is a bug
c=d; /* c=d will never happen */

In the eighties, i wrote a small filter (in C, of course) that can do two things. First, it can show you all the comments in a C program. Second, it can remove all the comments in a C program. This basically solves the debugging of this problem. By looking at all the comments, it becomes obvious, as in the previous example, that some code has been commented out. Since about that time, i've used '#ifdef' to effectively comment out code. In the early days of C, however, before the evil C preprocessor was invented, one would use another idiom:

if (0) {
c=d;
}

The optimizer would notice that the code couldn't be reached and left it out. But then, my comment filter wouldn't show it, right?

Where is this filter now? Well, i've planned releasing it, but it just hasn't happened. I might have published it here, but my home machine is down right now. I might finish repairs tonight.

This thing with comments is, of course, by design. The most memorable time that i was bitten by it was back in the 80x24 terminal days. Someone who worked for me submitted non-working code, but had run out of time to debug it. It turned out that this person used a comment block, where each line of the block had an open and close comment bit. However, the last line of the block's closing comment had a space between the '*' and the '/', so didn't close the comment. Why wasn't it obvious? The terminal didn't wrap the line (which was 81 characters long), it wrote all characters starting with the 80th in the last column. So, the space went there, and a millisecond later (9600 BAUD) the slash went there. At that point, it looked like '*/'. That may have been when i wrote the filter. But i may already have had it. Single stepping in the debugger showed that some code was missing.