Wednesday, December 19, 2007

Top Ten 6

When we last left our hero, we were looking at ten ways to get screwed by the "C" programming language. Today's entry is Unpredictable struct construction.

There are several issues that lead to the exact way that C structs are laid out in memory by any C compiler. Yet, it should be noted that, as a systems language, C was explicitly designed to allow one to set a structure pointer to map to a hardware device so that bits in that device can be manipulated. The first issue is that computer designers have two choices on how to lay out the bytes of a larger integer. These are Big Endian and little endian.

#include

int
main(int argc, char *argv[])
{
long abcd = 0x01020304;
char *ord = (char *)&abcd;

printf("%0x:%0x:%0x:%0x\n",
ord[0], ord[1], ord[2], ord[3]);
return 0;
}

Hex notation is used to set the long integer abcd. Each two characters of hex specifies eight bits, or one byte in memory. On an x86, Arm, and other machines, this prints "4:3:2:1". The character pointer points to increasing addresses. The first byte printed is 4, which is the value of the least significant byte in the long. This is little endian, meaning that the least significant byte goes into the smallest address. On a Sparc, 68000, and other machines, this program prints "1:2:3:4". That means that these machines are big endian. So if one creates a struct with a long integer (perhaps of type int32), writes it on an Arm, and reads it on a Sparc, the Sparc will read the bytes backwards from what the Arm wrote. The Sparc would have to reverse the bytes in each word read in order to be able to get the original meaning.

Some file formats allow either big endian or little endian storage. For example, the TIFF image format allows either. The file format has an indicator for which format was used. The reader program then knows if byte reversing is needed. The reason TIFF allows either is that very often, the same computer is used to read and write the image. If the computer uses the native format, then no reversing needs to be done. This is more efficient.

There are other problems with structures. One is word alignment. Consider this structure:

struct foo {
char first;
long second;
}

Most 32 bit machines read 4 byte long integers faster if they are aligned on an address that is divisible evenly by 4. That's because that's how the memory is physically addressed. If the word isn't aligned, then two words of memory need to be read in order to get the desired word of memory. Some computers take this farther. If an attempt to read a word is not word aligned, the computer generates a fault, and processing is halted. That way, all working programs are faster. On such machines, the C compiler makes sure that structures are always maximally aligned. Padding bytes are added to make sure any long integers are word aligned. So, the above structure is actually represented this way:

struct foo {
char first;
char pad[3];
long second;
}

The pad bytes are not directly usable by the program. They are, in fact,
not named. Now, if one doesn't really care how the bytes are ordered, one can always put the word aligned parts first:

struct foo {
int32 second;
char first;
}

This ensures that no padding will happen. Then if the structure is written to a file, and read from a file on another machine, one can be pretty sure that everything will at least start in the right places. One still has to deal with the Big Endian or little endian issue.

In the old days, pretty much ending with the Vax, binary floating point representations were nearly as varied as the number of different machines available. One solution to this problem was to write them out as text. For example, "-123.456e-21" specifies a very small negative number. The reading program would parse that with sscanf(3) or something, and the result would be a native format number. This parsing can be slow, but having the program work is important. However, since about when the 68000 and x86 got floating point, IEEE754 floating point has been adopted by nearly all chip makers. This format is uniform at the binary level across essentially all platforms today. There is no byte ordering problem. In fact, in 1990, a company that made databases optimized to work on CDROMs used IEEE754 floating point binary as the interoperable format. It supported a large number of computer platforms, and none of these platforms had to do anything special to make it work, unlike for long integers.

Could these issues byte you? Sure. The C language has many, many nods to optimization that turn out to be issues one must think about all the time. If you don't like to think, consider learning idioms to get specific tasks done, and simply reuse them as needed. That makes C a copy and paste language for you. However, if thinking isn't something that turns you on, perhaps programming isn't the right profession for you. Or, consider that C is a poor language choice.

No comments: