predelusional: January 2008

Thursday, January 31, 2008

Scheme For Enlightenment Part Eight - Hello, World!

The book, Practical Common Lisp is said to be for people who want to do more than factorials. Hello, World! is the logical next step.

You've got a Linux machine. If your distribution doesn't have guile installed, perhaps this will help:


sudo apt-get install guile

Under Unix, scripts in script languages like the sh, csh, perl, awk, and guile can be executed by having a first line start with the name of the program that will run them. I'm not going to get into a discussion about what is a script, a program, an application, or whatever. Here, a scripting language is one where the source can be executed by some system from the command line, so the operating system needs to know what system will execute it. There is no other valuable distinction.

It turns out that there are many ways to get a script run by guile, but this is one of the easiest:


#!/usr/bin/guile -s
!#
(display "Hello, World\n")

The pound bang and path to guile is reasonable enough. One must pass a command line option, -s to get guile to read the file and run it. The second line is a bit odd. In a departure from Scheme, Guile implements a multiple line comment syntax. It starts with pound bang, and ends with bang pound. Pound bang is not Scheme syntax. Scheme already has a comment to end of line - semicolon. So, Guile's authors decided on a multiple line comment concept. And it can get used for startup. The rest of the file is what you want to execute.

This script starts with a call to the function display, which writes it's argument to the standard output, which might be your terminal. Guile's display allows an embedded newline in the string, just as is done in C and other languages found in Unix environments. You may see this script written this way:


#!/usr/bin/guile -s
!#
(display "Hello, World")
(newline)

It does the same thing, but calls the newline function.

While the command line is important, there are also modules that allow GTK (graphical user interface) and CGI (web interface) applications written for Guile.

Wednesday, January 30, 2008

Scheme For Enlightenment Part Seven - A Little Fib

The Fibonacci function is another one of those defined recursively. The Fibonacci(0) is 0 and the Fibonacci(1) is 1. The Fibonacci(x) is Fibonacci(x - 1) + Fibonacci(x - 2) for integers more than 1. The Scheme implementation is straight forward.


(define (fib x)
  (if (= x 0)
      0
      (if (= x 1)
          1
          (+ (fib (- x 1)) (fib (- x 2))))))

Since there are two calls to itself, followed by addition, this is not tail recursive friendly. Also, since there are two calls to itself, and each of those calls calls itself, the number of total calls goes up faster than x. The time taken for fib(x + 1) is about 1.6 times as much as fib(x). Large x takes exponentially longer than smaller x.

Since addition, rather than multiplication is used, the answers don't become absurdly huge too quickly. The answers can be stored in a 32 bit unsigned integer for x up to 47, and a 64 bit integer can cope with the answers up to 93. With this in mind, a C version is more practical.


unsigned long long
fib(int x)
{
  if (x == 0) {
    return 0;
  }
  if (x == 1) {
    return 1;
  }
  return (fib(x - 1) + fib(x - 2));
}

This returns the same numbers for up to 93. However, fib(93) will take thousands of years to run. Yet, at this point, the Guile version and the C version can be used as a sort of benchmark. It turns out that the C version is about 100 times faster that Guile for the same x for this benchmark. Neither version requires huge amounts of memory. The call stack doesn't get very deep, nor does much go onto the stack.

The tail recursive friendly version of this function is instructive. A helper function is defined which does all the work. It is passed two additional parameters which allow it to accumulate intermediate answers. They are passed initial constant values to get it going. There's only one call to itself, and it really is the last thing the function does, so it's properly tail recursive. It produces the exact same answers as the above functions.


(define (fib-helper a b x)
    (if (= x 0)
        b
        (fib-helper (+ a b) a (- x 1))))

(define (fib x)
    (fib-helper 1 0 x))

One might expect a small constant speed improvement due to recursion. But this function isn't just a little faster, like four times faster. This function is amazingly fast. For example, (fib 93) requires an essentially unmeasurably small amount of time. That's much less than thousands of years. So what's going on?

The key is that the helper only calls itself once. For a given x, it calls itself that many times. It doesn't have exponential time, it has linear time. Well, it's not exactly linear. As x gets larger, the arithmetic takes longer because the numbers are bigger. Even so, (fib 1000) spits out all 209 digits basically without pause. This algorithm is so much better, that the original recursive version should just be scrapped. More than one Scheme tutorial demonstrates the slow recursive Fibonacci function. One made a cryptic remark that this algorithm computes Fibonacci numbers slowly. This new version was presented as tail recursive friendly, but totally failed to mention that it is a wildly different algorithm. As a point of fact, i have no idea why this function should return the same answers as the original.

This version is so much faster that a C version is pointless. Recall that (fib 93) with this version is essentially instant. The C version might be 100 times faster, but that's still instant, for a total time savings of essentially zero, best case. Actually, a best C version wouldn't do any calculation at all. If one is limited to 64 bit integers, the function can only answer 94 questions (starting with zero). The fastest version would have a small table of answers and just return one of them.

Tuesday, January 29, 2008

Scheme For Enlightenment Part Six - Recursive Language Comparison

C is a procedural language but it can also do recursion. And, it's call stack is deeper than Guile's. Isn't that odd? Maybe not. Guile is written in C. At least under Linux, sstk(2) is not implemented, so the stack size can't be changed. So Guile is limited in the same way that C is. C's stack is likely more efficiently represented, so more can go in the same sized area, and therefore feels larger. Here's the factorial in C, recursively.


int factorial(int x) {
    if (x == 0) {
 return 1;
    } else {
 return x * factorial(x - 1);
    }
}

It hasn't improved. It's still limited to x = 13. But try this rather spooky equivalent. Is it starting to look like Scheme?


int factorial(int x) { return (x == 0) ? 1 :  x * factorial(x - 1); }

If unsigned 64 bit integers can be used, then factorial(20) can be computed. Or C's double (floating point) type can be easily used. That gives you approximate answers (13 digits) and works up to factorial(170). It has the side effect that it knows when it overflows in the answer, reporting inf.

But let's say we really want exact large integers in C, as we get from Scheme. What do we do? The answer is that we link with an arbitrary precision number library, change the data type to the library's type, and use the library to do the multiplies. The structure would remain, and it would still be fast. After all, Guile's big numbers are written in, uhm, C. And, Guile is open source. At worst, one could use the Guile implementation of big numbers, and the two languages would end in a draw, more or less. It is much more likely is that a convenient package exists that doesn't use garbage collection, and is faster than Guile.

So, is the factorial example a good example? That is, is it fair to the languages involved? Perhaps. While short, it's complicated enough that there is more than one way to write it in each language. It brings up issues that each language enjoys. What's not to like?

Back to enlightenment. The question is, does one have to cope with recursion in order to gain any enlightenment from Scheme? Unfortunately the answer seems to be yes. Functional programming and other styles have their basis here. I'm not out to replace my procedural style with something new. I'm out to add flavor to the stew. However, it's clearly not gonna happen without some kicking and screaming.

Monday, January 28, 2008

Scheme For Enlightenment Part Five - Comparative Languages

The dreaded Scheme factorial example is written in C.

I'm coming most recently from the C world. What does it look like in C? C programmers usually think in loops, so something like this comes out.


int factorial(int x) {
    register int product = 1;

    do {
        product *= x;
    } while (--x);
    return product;
}

The variable product is the factorial being built. In each loop, it is multiplied by the current x. The variable x is decremented by one. When the loop counts down to zero, the result is returned.

It's wickedly fast. And it works for positive arguments all the way up to 13. On 32 bit machines, factorial(14) is larger than an integer. Also, we didn't bother to check for zero (or less), so it loops for a long time before getting the wrong answer in these cases.

The fact that it is six or seven lines instead of three or four does not bother me. All of the lines are reasonably easy to read. Besides, C is free format. All the lines can be jammed together, even if it looks strange.


int factorial(int x) { int f = 1; do f *= x; while (--x); return f; }

Sunday, January 27, 2008

Scheme For Enlightenment Part Four - Even More Factorials

The dreaded Scheme factorial example is beaten over the head and shoulders recursively.

Another variant of the loop factorial. It uses a "let loop".


(define (factorial x)
  (let loop ((count x) (product 1))
    (if (= count 0)
 product
 (loop (- count 1) (* product count)))))

The loop starts, creating variables count and product. The variable count starts at x and counts down to 1. But also, product, starts at one. That's the second line. Once each loop, product is multiplied by count, and count is decremented. That's line five. When count is zero, product is returned in the forth line. One odd bit of the syntax is how the loop works. It looks recursive. If so, it's tail recursion and fully optimized. But it really doesn't matter if it's a iterative loop or recursion. Scheme can decide that. That is, unless it can't be nested. The do loop can certainly be nested. It's reasonable to assume that let loops can be nested also. The full inner loop would execute before the outer loop, and each would benefit from tail recursion optimization, at least in theory.

The original version of the above let loop factorial was more confusing. In it, the counting variable was named x. That's the same name as the parameter. One expects that the inner definition hides the outer definition. So how does Scheme know to assign the outer definition to the inner for the initialization? Apparently, it's designed to. But that doesn't make it consistent.

As an aside, in my examples, i've used a short cut syntax for defining all functions. In the Guile Reference, section 3.1.2.4, it is suggested that this form should not be used. The lambda syntax should be used instead.

It could be argued that the alternative define forms are rather confusing, especially for newcomers to the Scheme language, as they hide both the role of lambda and the fact that procedures are values that are stored in variables in the some way as any other kind of value.

Using the lambda syntax, the above example becomes this.


(define factorial
  (lambda (x)
    (let loop ((count x) (product 1))
      (if (= count 0)
   product
   (loop (- count 1) (* product count))))))

While the lambda keyword does indicate that a function is being created, and the define keyword indicates that this function is getting bound to a name, i don't like it much. It has yet another nested set of parenthesis, which means that there is that much more i might get wrong. There's an extra line of code (though there needn't be). The indention of the body of code is deeper, at least in the emacs pretty printer style. And finally, lambda is a Greek letter that doesn't mean anything. Go ahead, look up lambda in Google. It has been assigned all kinds of meanings. Scheme and Lisp are high in the list. I think Dennis Richie got it right. The language keywords are not the important part of the language. The important part is the bits that the programmer has created in it. So, in C, structure uses one character keywords. Here, we have the opportunity to omit a whole keyword entirely. So much the better. As a recent novice, i consider lambda an advanced topic that one can safely ignore for awhile.

Friday, January 25, 2008

Scheme For Enlightenment Part Three - More Factorials

The dreaded Scheme factorial example is continued recursively.

Scheme (and Lisp) purists continue to promote recursion as the right way to do things. Scheme, or at least Guile, Lispme, mzScheme, umb-scheme and probably all others, support free tail recursion. That is, if the very last thing your function does is call itself, then no stack penalty is paid. Unfortunately, the first example has a multiply after the last call to itself, so it must use a stack. There is a solution, but it isn't as simple.


(define (factorial x)
  (letrec ((tr-factorial
     (lambda (count product)
       (if (> count x)
    product
    (tr-factorial (+ count 1)
    (* count product))))))
    (tr-factorial 1 1)))

This tail-recursion version is much more complex. The second line defines a helper function, which calls itself. This helper function takes two parameters. The first is the current number to multiply by, and the second is the current accumulated result. In the forth line, it checks if it's done by comparing with x, the original parameter. If it is done, it returns the answer. Otherwise, it multiplies the current number by the accumulated answer, adds one to the current number and calls itself with these values. The last thing it does is calls itself, so the Scheme implementation can do the tail recursion optimization, and no additional stack is used on each call. The implementation can be said to jump to the start of the function, instead of call it. The outer definition calls the inner definition in the seventh line, starting the count at one and the accumulator at one, and the result from this call is returned. And, if one feeds this version 1000, it quickly spits out all 2,568 digits of the answer.

One is expected to use recursion for loops. And this really seems to be how one is expected to do program recursion in Scheme. Enlightenment isn't going to come easy. It works, but there are some issues with it. First, there's a weird interaction between the nested function definitions. The inner definition needs access to the outer definition's parameter. One supposes that it could just be passed as a third parameter. Then it could be a standalone function, and one would always call it with two extra parameters, set to one. In fact, Guile supports optional arguments with default values. So this can be written as a single function, at least in principal. We won't go there.

Another issue is that it isn't obvious from the source that the inner function is actually tail-recursive. The self call is smack-dab in the middle of the function definition. This has implications if one is going to use this language. As a programmer or debugger, one must know. Are there tools that could quickly tell you? Well, you can (trace) your call to find out. The tutorials talk about how you can do it with inspection. Plan to get good at it.

In timing tests, the looping structure is quicker than recursion. It isn't much faster in this example, because for large factorials, most of the computer's time is spent doing the large multiplies. For one brought up with procedural languages, the loops are easier to write, to read, and to debug. In fact, having variables to watch in a Scheme debugger provides clues as to what is going on. In a pure functional program there are no variables, so the debugger can only tell you what happens in function calls. This can be less than helpful.

Apparently, at least part of incentives to use a pure functional style is that such programs are easier to prove are correct, at least in a mathematical sense. While noting that setting variables is convenient, this quote from The History of Lisp is an example: (Fans of other programming languages are challenged to write a program to concatenate lists and prove that the operation is associative). Presumably, such a proof would involve having another program use logic to make such a statement.

But one still must cope with other language issues. For example, how does one cope with the parenthesis? A good editor can tell you if your parenthesis are lined up. Vi and Emacs can do this. Some editors, like Emacs, and other external tools can pretty print functions. The pretty printing can expose some of the structure. And, after a year of beating on the language, it gets a bit easier. But there's still the problem of where the various bits of the do loop should go. The syntax is horrible. And i've gotten comfortable with the full Unix find command, and regular expressions. I grok horrible.

Thursday, January 24, 2008

Scheme For Enlightenment Part Two - More Factorials

The dreaded Scheme factorial example is continued.

Consider this factorial version, which uses a do loop instead of recursion. It has two extra variables, is a little longer, and what is going on anyway? One expects the first example to be better all around. But if one feeds this version 1000, it quickly spits out all 2,568 digits of the answer. Since it works with bigger numbers, and is as quick or quicker for smaller numbers, it's better, right?


(define (factorial x)
  (do ((count 1 (+ count 1))
       (product 1 (* count product)))
      ((> count x) product)))

The second line starts the do loop. The variable count starts at one, and is incremented by one each time through the loop. In the third line, the variable product also starts at one (same loop), and is multiplied by the variable count each time through the loop. The forth line says that the loop continues until count is more than x (the parameter). When all is finished, the value of product is returned. Let's walk an example.


(factorial 3)

Count starts at 1, product starts at one multiplied by count which is also one, and so is 1. The parameter x, which is 3 is compared to count. Since count is not larger than the parameter x, the body, which doesn't have anything in it, is executed. One is added to count, resulting in 2. This is multiplied by product, and product gets the answer, (* 1 2) is 2. count, which is 2, is compared to x, which is 3. Count is still not bigger than 3, so the loop restarts again. One is added to count, and so it now has the value 3. The variable product, 2 is multiplied by count, 3, and the result, 6 is assigned to product. Count, 3, is compared with x, 3, and it isn't bigger yet, so you'd think that the loop continues. However, that's not what it does, and we know this because the answer isn't 24, it's 6. Somehow, count gets incremented so that it is greater than 3, but product is not multiplied by 4. This mystery is solved, i think, by reading the manual. From the Guile Reference, section 5.11.4, The test expression is a termination condition. Looping stops when the test is true. It's evaluated before running the body each time, so if it's true the first time then body is not run at all. So the order of operations isn't exactly the way that i've outlined it.

So, in theory, the pure recursive version is simpler and better. In practice the loop version is faster and gives correct results for a wider range of input. What's the difference between theory and practice? In theory, they are the same...

Wednesday, January 23, 2008

Scheme For Enlightenment Part One - Factorials

The dreaded factorial example.

This series is starting to look like a Scheme tutorial. While most Scheme tutorials start with this example, they usually gush over the language as the ultimate for expression, hackability, etc. My opinion is that if you invest enough in a language, your tendency is to advocate that others share your pain. My approach here is to expose the weaknesses when noticed, without apology. Even my current favorite language, C (or maybe English), has plenty of warts. This is my approach to sales. I say things like, Have a free bagel. It only has 500 calories. That's probably only a quarter of the calories you consume in a day. So giving away free bagels can be difficult. On the bright side, my investigations continue with the hope that something good will come of it eventually. Feel free to share my pain.

Some Scheme syntax. (First comes an open parenthesis. Then comes a verb, which is a function or a method. Then comes space separated list of all the parameters: the nouns. The list has zero or more of these nouns. The list ends with a close parenthesis.) There is a verb that performs multiplies.


(* 5 6)

The asterisk is for multiplication. The 5 and 6 are things to multiply. The expression returns 30. Since expressions return nouns, they can be used as parameters to other functions.


(+ (* 5 6) 7)

This returns 37. The 30 returned by the multiply is added to the 7. These functions can be nested arbitrarily deep. The Scheme language does not place limits, as far as i know, though implementations certainly do.

This is functional programming. Apparently, the idea that one might assign a value to a variable is not part of functional programming. Scheme allows variables, but there are pure functional languages, like j, which don't.

The factorial of an integer is the product of all the integers from 1 to that number. The factorial of 0 is 1. The factorial of 3 is 6 (3 * 2 * 1). Now the mathematician says, we know the factorial of zero is one. We know the factorial of one is one times the factorial of zero, which we already know is one. And so on - the factorial of a number x is x times the factorial of x - 1. Consider this Scheme implementation.


(define (factorial x)
  (if (zero? x) 1
      (* x (factorial (- x 1)))))

It is purely recursive, it matches the definition of the function, doesn't use any extra variables, and is short. There are a boatload of parenthesis, each one in the correct place. Get used to it. In the second line, the function checks to see if the parameter is zero, and if it is, it returns the number one. This is the factorial that we know - the base condition. In the third line, we are in the else clause of the if, so x is not zero. It subtracts one from it's parameter, calls itself with that, and multiplies the result by it's original parameter. This function works for small positive numbers. Let's step through an example.


(factorial 3)

We're going to substitute the actual numbers. In the first call, 3 is not zero, so the return is (* 3 (factorial 2)). There is still a call to do. In that call, 2 is not zero, so the return is (* 2 (factorial 1)), substituting that in above, real result is (* 3 (* 2 (factorial 1))). The number 1 is still not zero, so (* 1 (factorial 0)) is returned. In this case, 0 is equal to zero, so the number 1 is returned. Substituting that in, we get (* 3 (* 2 (* 1 1))). The inner most multiply is computed, resulting in (* 3 (* 2 1)). The next inner most multiply is computed, resulting in (* 3 2). Then finally, this multiply is computed, resulting in 6. If you type in this example into any Scheme system, the result, 6, is printed.

While it works for small positive numbers, if you feed it a thousand, Guile reports this:

ERROR: Stack overflow

Now, the factorial of 1000 is a very large number indeed. But the stack overflow has little to do with the size of the answer. Scheme supports arbitrarily large integers. Millions of digits, for example are not only supported, but practical. Computations with million digit numbers can be performed in a reasonable amount of time on modern machines.

The reason it gets a stack overflow is that it needs a stack level for each of the thousand calls to itself. The system runs out of stack space. The Earth rests on the back of a turtle. What does that turtle stand on? Another turtle! It's turtles, all the way down.

Tuesday, January 22, 2008

Scheme For Enlightenment Part Zero

As a part of my ongoing road to enlightenment, i've just finished reading the Guile Reference Manual. Guile is an implementation of the computer language Scheme. Scheme is a dialect of the Lisp language. A million years ago, when i was an undergrad, i managed to get credit for just one computer course, entitled Lisp/Algol. Both are relatively ancient computer languages. Lisp dates from something like 1956, and we used Algol60, the 1960 vintage standard. The seven week course spent some five weeks on Lisp, and the remaining two on Algol. Yet, two weeks in Algol seemed like luxury compared with only five weeks with Lisp.

The idea that a language as sophisticated as Lisp was invented so early came as a complete surprise. In retrospect, early computer users were taken from the pool of mathematicians. And Lisp is a language written by mathematicians for mathematicians, much as C is a language written by systems programmers for systems programmers.

Despite eight semesters of college math, i can't really say that i think like a mathematician. You see, my math was heavy in differential equations for engineering. I actually liked Boundary Value Problems. That's because there are obvious real world applications for this stuff. As a rocket launches, even if the thrust is constant, the mass decreases as fuel is consumed, the air drag decreases with altitude, but increases with speed. So where does the rocket end up?

I know how to go from home to work. I used Yahoo Maps to find out how to get to Phil's house from work. Even if i'm home, i can get to Phil's by going to work, then using my new directions to get to his house. As an engineer, that's OK. Phil happens to live pretty close to where i work, at least compared with the total distance to where i work. Dave happens to live pretty near where i live. So, if one day, i wanted to go directly to Dave's house from work, i'd look that up on Yahoo Maps, and go. But if i then wanted to go to his house from home, i'd look that up again on Yahoo Maps, because going 80 miles out of my way to get to Dave's, only 4 or 5 miles from my house is absurd. But the mathematician considers the problem solved, and optimization is irrelevant. Worse, the mathematician assigns cryptic symbols to the way points. My house is mnemonically labeled 'A', and Phil's 'B', and Dave's 'C'. So, she says, "I know how to get from 'A' to 'B', and from 'B' to 'C', so I know how to get from 'A' to 'C'". Global warming notwithstanding.

And Lisp or Scheme is, so far, like that. There are lots of examples that are very short that are also very slow. It's also not at all clear how they could be made quicker. More on this later.

For a year i plowed through a Scheme tutorial (Teach Yourself Scheme in Fixnum Days), beat on Lispme (a Scheme implementation for the Palm Pilot), and now have Guile loaded on my Nokia hand held computer (i don't care for the term 'tablet', as it seems to limit, rather than enable wide application). Guile is also loaded on my Linux desktop where there's a real editor and keyboard. A real editor is a must.

Scheme, as a language, has some unusual features. Integers can be arbitrarily large. They can be millions of digits long. That suggests certain application classes, like discovering large prime numbers. Guile's implementation of larger numbers also seems to be fast. It also has support for non-local branching, handy for error recovery. And there's this kind of odd speculative computing which seems to be able to search arbitrary decision trees. There's also an odd programming style that seems encouraged - functional programming. There might be something of value here. Who knows?

Guile is an implementation of Scheme. But the Guile developers have added capabilities to encourage certain kinds of work. It is possible to embed a Guile interpreter in an otherwise C application, perhaps as an extension language, or configuration module. Additionally, it is reasonably easy to extend Guile by using C modules, which can be plugged into Guile.

Though Guile has been used this way, i personally don't see it. Lisp has never had the ease of reading that an Algol-like language has. Sure, it has automatic memory management. But it comes at a price. Guile is between 100 and 500 times slower than C. As near as i can tell, all the Scheme implementations are. However, there is a Scheme compiler that i have not examined.

The Guile Reference Manual talks about how one should do things certain ways for speed. In some ways, this is not a shock. A factor of 100 is roughly 10 years of computer speed improvements. So, you fire up your hot new 4 GIPS computer, to get Pentium II performance. And yet everywhere in the manual, recursion is held up as the iteration device. Guile's own do loops work faster, and never suffer stack crashes because you didn't get the tail recursion right.

My guess: mathematician's proofs use indirection. Recursion is indirection, only more so, but with little extra effort, or at least syntax. This blinds them to the fact that computers naturally deal with iteration more efficiently.

Lisp has often been the preferred tool for the development of artificial intelligence. Might the field move forward 10 years with a conversion to C?

Monday, January 14, 2008

Top Ten, 18 of 10

When we last left our hero, we were looking at ten ways to get screwed by the "C" programming language. Today's entry is Fabulously awful "standard libraries"

It turns out that libraries are harder to write than C compilers. Now, designing a good language may be difficult. But implementing an already standard language is a matter of meeting the requirements. Libraries are generally not written to standards. One writes a library by one of two processes. Either one has a task to do, and builds a library to support that task while building an application. Or, one tries to guess what functions programmers might need, and provides functions that do those things. Both approaches have problems. Your application might not need something other programmers will need. Your application might need the same things, but with peculiar semantics, making the implementation difficult to use. Or, you might not have guessed what other programmers might need. C's standard library isn't worse than most. There were a few mistakes, such as scanf(3), and index(3). There were some omissions in the original version, like strstr(3). While the bugs that were designed in are still there, for backward compatibility, solutions to those problems are available. One is told to use the fixed versions. Don't use scanf(3) - use fgets(3), then sscanf(3). Don't use index(3), use the new name strchr(3). I count the standard C library as a success because i actually use it, and don't invent a new string library. At least scanf(3) has a good excuse. The idea was to allow string parsing to work on streams. It's just that scanf(3) isn't a good way to make that happen. That said, i have written a stream based parser that needed it's own version of atof(3). That's about it.

Friday, January 11, 2008

Top Ten, 17 of 10

When we last left our hero, we were looking at ten ways to get screwed by the "C" programming language. Today's entry is Signed Characters/Unsigned bytes.

There is no byte datatype in C.

Yes, integers are signed unless you say unsigned. Yes, integer arithmetic that overflows or underflows is silent. That's largely because that's how integer arithmetic works on most machines. On an x86, char is signed. However, at least on an Arm processor, char appears to be unsigned. So it's really worse than inconsistent on one machine. Behavior can change when you port to another machine. In this case, the x86 and the Arm processors are 32 bit machines. They have the same byte order within words, which is to say that they are both little endian. The same compiler, gcc, was used in both cases.

Consider a program that reads a file a character at a time:


#include <stdio.h>

int main(int argc, char *argv[]) {
  char c;
  while ((c = getchar()) != EOF) {
    putchar(c);
  }
  return 0;
}

On an Arm, gcc reports that the variable c will never equal EOF. And, indeed, if you attempt to run the program, it reads the file, then writes a bogus character (0177) forever. It does not see the End of File signal. On an x86, this program behaves like cat(2), unless the file has a 0177 character somewhere in the stream. If it does, it stops processing at that point, even if there is more data to read. If the declaration of c is changed from char to int, the problem vanishes. Now, EOF is documented to be int. It's supposed to be a value that is never a character that could be in a file. However, conceptually, the program is reading a file one character at a time. So, it's easy to forget.

By contrast, floating point overflows can generate exceptions. That's because that's how floating point hardware works.

In the Lisp language, integers never overflow. That's because if you add one to a small integer, it simply becomes a larger integer. When it becomes too big for a 32 bit integer, it magically becomes something larger than 32 bits. You always have the right answer. And yet, it turns out that this feature isn't something that i use much in Lisp. I'd be happier if integers were all small and fast. Especially fast. The good news is that you can now get the performance of a an original Pentium using the Lisp language, if you have a modern computer to work with.

Thursday, January 10, 2008

Top Ten, 16 of 10

When we last left our hero, we were looking at ten ways to get screwed by the "C" programming language. Today's entry is Octal numbers.

In C, the list 007, 008, 009, 010 is interpeted in octal, not decimal, with 010 equal to 8 decimal. Fortunately, the compiler emits an error for 008 and 009, as they have non-octal digits. This is not a warning. No object code is produced.

If you are unlucky enough not to have included an 8 or a 9 in your list of numbers, there is no warning. You just don't get the numbers you might have expected. If you want to align numbers in your code virtically, use spaces.

I confess that these days, i write most of my binary constants in hex rather than octal. These constants start with a 0x prefix, and are hard to miss. Also, octal digits encode three bits. Three is not a divisor of common word lengths used in computers today. That is, we have 8 bit bytes, 16, 32, 64 and 128 bit words. But in the old days, we used to have 12, 18, 36 and 60 bit words. Octal made more sense then, especially for 18 bit machines. The 8080 microprocessor has instructions which are easier to decode in octal than hex, because the "mov" family was 01ab, where a was the destination register and b was the source register. These machines are less common now.

But if i were God, and could redesign human civilization from scratch, it would be octal, not decimal. Mental arithmetic would be easier. Circles would be divided into 01000 (which is 512) degrees, and further subdivided with a decimal-like notation. The second would exactly be 1/0100000th of the day (1/32768th), which would make it about 2.6 times longer, and the phrase "just a second" would make a little bit more sense. By the time computers and C were designed, there'd be no reason to support decimal.

This would not be a good reason to make C support octal, as it still wouldn't match word lengths. A good reason for C to support octal is that it is a systems language.

One solution, that C didn't take, is direct support for binary. There are some macros for supporting binary, and they're ugly, but they work. Real men use binary. And, There are 10 kinds of people in the world. Those that understand binary, and those that don't.


/* Binary constant generator macro
 * By Tom Torfs - donated to the public domain
 */ 
#define HEX__(n) 0x##n##LU

#define B8__(x) ((x&0x0000000FLU) ?   1 : 0) \
       + ((x&0x000000F0LU) ?   2 : 0) \
       + ((x&0x00000F00LU) ?   4 : 0) \
       + ((x&0x0000F000LU) ?   8 : 0) \
       + ((x&0x000F0000LU) ?  16 : 0) \
       + ((x&0x00F00000LU) ?  32 : 0) \
       + ((x&0x0F000000LU) ?  64 : 0) \
       + ((x&0xF0000000LU) ? 128 : 0)

/* for up to 8-bit binary constants */
#define B8(d) ((unsigned char)B8__(HEX__(d))) 

/* for up to 16-bit binary constants, MSB first */
#define B16(dmsb,dlsb) (((unsigned short)B8(dmsb)<<8) + B8(dlsb))

/* for up to 32-bit binary constants, MSB first */
#define B32(dmsb,db2,db3,dlsb) (((unsigned long)B8(dmsb)<<24) \
 + ((unsigned long)B8(db2)<<16) \
 + ((unsigned long)B8(db3)<<8) \
 + B8(dlsb))

#include <stdio.h>

main()
{
  int i;
  unsigned int a[] = {
    B8(01010101),
    B16(10101010,01010101), 
    B32(10000000,11111111,10101010,01010101)
  };
  printf("%d\n", atoi("016"));
  for (i = 0; i < 3; i++) {
    printf("entry = %u\n", a[i]);
  }

}

Wednesday, January 09, 2008

Top Ten, 15 of 10

When we last left our hero, we were looking at ten ways to get screwed by the "C" programming language. Today's entry is Utterly unsafe arrays.

There are uncountably many ways to walk off of the end of an array in C. And, there are only two ends of an array. C happily allows you to walk off either end with undefined results.

The reason this is true is that C is a systems language, written by systems people for building systems. By systems, it is understood that it is meant operating systems - the bits that control everything on the machine. While operating systems have been written in other languages, assembly, Lisp, Pascal, and so on, few have been as sucessful for this task as C. C was designed as a replacement for assembly language. It adds portability and some expression. But, as much as possible, it does not subtract anything. That seems to have been the goal, anyway. In practice, it doesn't subtract much. As a system's programmer, i have to say that i like C. You'll have to pry my unsafe arrays from my cold dead fingers.

What possible use do they have? I have actually used (and it worked), code like this:


    a = b[-4];

Tuesday, January 08, 2008

Top Ten, 14 of 10

When we last left our hero, we were looking at ten ways to get screwed by the "C" programming language. Today's entry is Under constrained fundamental types.

I learned C working on 16 bit PDP-11's in a world of mostly 32 bit Vaxen. These days, i work mostly with 32 bit machines even though 64 bit machines are commodity items. My code tends to be portable and efficient where porting to a new architecture is largely simple recompilation. In my opinion, there are no 8 bit machines. Pointers on 8080's, z80's, 6800's, 6809's, 6502's and 1800's were all 16 bits. These are 16 bit machines.

In C, an int must be at least 16 bits, but compilers can make them larger. It's reasonably easy to write code on a 16 bit machine and port it to a larger 32 bit architecture. It takes more effort to do the reverse and get it right. I've personally seen compilers that provide 16, 32, 36 and 64 bit integers when asked for an int. This was clearly a language design feature. It allows the same code to work reasonably efficiently on machines of different sizes.

I've been asked if, every time i use an int, do i really think of it as being only "at least 16 bits"? The answer is, yes, i really think about it, but no, sometimes my expectation is that this int will be able to index my largest arrays on the local machine. Since strings are arrays of single bytes, it is expected that sizeof(int) == sizeof(char *). The language does not guarantee this, however. It should. When the Digital Equipment Corp Alpha processor team announced that their C compiler had a 16 bit short, a 32 bit int, and a 64 bit long, i was disappointed. A 32 bit int only gives you four giga entries into an array, yet the machine is capable of having arrays quite a bit larger. I had also expected that int was a redundant type that could be either short or long. I doubt that i depend on this. The discipline of thinking about something every time is a real asset to programmers. It is the only known solution to the fence post problem, for example.

Now, gcc's long long int concept is great for 32 bit machines providing 64 bit integers. Here, int is 32 bits which matches pointer sizes, and is efficient on such a machine. The long long int type is strictly for larger integers. And, i've used them for computing hashes. Very efficient. It should be noted, however, that the keyword that originally meant an integer twice as big as a pointer was orginally long. The long type on the PDP-11 was 32 bits, where an int and char * was 16 bits.

Without a doubt, C needs a new integer size keyword that allows for 64 bit integers. So if i were God and could go back in time and fix this in C's design, it might look like this:

char *	char	short	word	sword	pen	long	int
16	8	16	32	64	128	32	16
32	8	16	32	64	128	64	32
64	8	16	32	64	128	128	64

In the mean time, perhaps one can #include <sys/types.h> and get typedefs for constrained integer sizes up to 64 bits. But long is largely broken.

The migration from 16 to 32 bit machines, and now from 32 to 64 bit machines represents growing pains for C. It's a hard problem, and remains painful.

Monday, January 07, 2008

Top Ten, 13 of 10

When we last left our hero, we were looking at ten ways to get screwed by the "C" programming language. Today's entry is Cluttered compile time environment.

The actual complaint is that the standard header files declare symbols that a programmer might accidently use. The example was a programmer declaring BUFFSIZE, but then accidently using BUFSIZ, which is declared in stdio.h. The programmer might accidently use BUFSIZ if the programmer is used to typing it. I'd argue that the programmer has used a BUFSIZ like symbol because stdio introduced the idea.

I've seen worse. A long time ago, i saw a library that declared the external variable line. Naturally, it was a library where the caller might use such a common word as a variable. And, it happened.

There are conventions that C library writers use to prevent this sort of thing. For example, using a common prefix for all library functions and global variables helps. While it helps, it can't make a guarantee that this problem won't come up. C is not object oriented. There is no way for C to encapsulate symbols into a module. The potential for incursion is real. Library writers must do the best they can.

I wouldn't go so far as to state that adding object oriented scoping rules would fix C. I am not of the opinion that Java's effectively infinite sized object names is a good solution to this problem. As a matter of fact, i like C's relative terseness.

Fortran allows one to simply use a variable, and the compiler will allocate space for it. This leads to the problem that a typographical error introduces a new variable when you intended to assign to an existing variable. These problems are, indeed, hard to find. I can recall scanning symbol tables for such errors.

C requires that variables be declared before they are used. Therefore, this problem that Fortran enjoyed is gone. And, indeed, the C preprocessor requires that symbols be defined before use as well. However, no warning is given. The symbol you wanted replaced is just not replaced. Generally, this leads to a compilation or link error.

Sunday, January 06, 2008

Top Ten, 12 of 10

When we last left our hero, we were looking at ten ways to get screwed by the "C" programming language. Today's entry is Uninitialized local variables.

Once again, gcc's -Wall option warns with: 'B' may be used uninitialized in this function, even for the more complicated version. The only way that this is a problem is that sometimes you have a case where a variable is always set, but possibly in different places. The compiler doesn't know that at least one path will be taken, so it thinks a variable might not be initialized, and warn you. In these cases, i usually break down and provide a redundant initialization. At least it's cheap.

Initialization of a variable is probably cheap, especially if it's infrequent. So, the optimization of leaving it off, because you know it won't be used before set probably doesn't save much memory or time. Now that compilers can warn you of this condition, they could also optimize out an explicit initialization, and do it safely. Of course, if you initialize a variable with the return value of a function, the compiler would also have to compute any side effects before it could optimize it away.

Friday, January 04, 2008

Top Ten, 11 of 10

When we last left our hero, we were looking at ten ways to get screwed by the "C" programming language. Today's entry is Undefined order of side effects.

Once again, gcc's -Wall option warns with: warning: operation on 'm' may be undefined. I've never seen this error in practice. This is probably because it isn't clear to me what the result should be. As a flaming conservative, i would cautiously break such a statement into statements that could have no possibility of doing anything other than what i want to happen. Or is that the flaming control freak that i harbor way down deep? I even add extra parenthesis if there is the slightest doubt that an expression might be evaluated out of order, even when the language makes hard promises. It's easier than looking it up.

Thursday, January 03, 2008

Top Ten, 10 of 10

When we last left our hero, we were looking at ten ways to get screwed by the "C" programming language. Today's entry is Unsafe returned values.


char *f() {
   char result[80];
   sprintf(result,"anything will do");
   return(result);    /* Oops! result is allocated on the stack. */
}

The issue is that the variable results is an automatic variable. It is allocated on the stack. It is not initialized, and has whatever garbage happened to be there. That's fine. The function writes to it before it is returned. However, when the function returns, the memory on the stack is free to be used by other routines. It may get overwritten before it can be used. For example, if the the caller then calls printf() to print the result, that may use enough of the stack memory to corrupt the value, with unpredictable results. If you're lucky, you get a segmentation violation. If not, you get the wrong results from a program that appears to work.

These days, gcc reports warning: function returns address of local variable even without -Wall for this code. It has been a problem in the past. Indeed, a code review of production code showed a library had this issue with essentially every function in the library. The solution was to make the buffer static in every case.


char *f() {
   static char result[80];
...

In this case, the result buffer is allocated on the heap, and will remain untouched until that routine is called again. That routine may overwrite the value. But since the static keyword has scope implications as well as storage class implications, one knows that only that routine can do anything to it. That is, the caller knows that if it avoids calling that routine again, the value won't change. There are call graph utilities that can help the programmer be sure. Of course, the caller has a pointer to the buffer and can do whatever they want with it. But surely, if the programmer does that, they'll know they're changing it.

In Andrew Koenig's book, C Traps and Pitfalls, there is a subtle variant of this issue. On the PDP-11 (a sixteen bit machine), memory was often tight. To get a few extra bytes of heap, programmers would often declare a buffer on the stack, and call setbuf(3) to tell the standard library to use it instead of one on the heap for one of the standard streams, like stdout. Since main() has scope and life for the life of the program, it sounds as if this should always work. It never failed. However, the standard library also has onexit(3), where you can register a function to be called whenever exit(3) is called. The trouble is, these functions are called with the stack unwound to the caller of main - traditionally called crt0. This was a small chunk of machine code that set up the stack and called main(). The upshot is that functions might be called after main() has exited and it's stack variables are free, and possibly corrupt. If you have enjoyed the top ten, you'll likely enjoy Andrew's book. This second reference, has a table of contents.

Wednesday, January 02, 2008

Top Ten, 9 of 10

When we last left our hero, we were looking at ten ways to get screwed by the "C" programming language. Today's entry is Permissive compilation.

Three examples are presented. The first has to do with the comma operator. The gcc's -Wall option at least warns about it. The second has to do with initialization of variables. The third is an incompletely specified array. I have to say that i agree that all these problems are serious language or implementation issues.

I once modified some code that called a function via a macro:

CALLIT(functionName,(arg1,arg2,arg3));

CALLIT did more than just call the function. I didn't want to do the extra stuff so I removed the macro invocation, yielding:

functionName,(arg1,arg2,arg3);

Oops. This does not call the function. It's a comma expression that:
Evaluates and then discards the address of functionName
Evaluates the parenthesized comma expression (arg1,arg2,arg3)

There are two issues here. First, C compilers are not required to complain that no code is produced. The gcc compiler's -Wall option says warning: left-hand operand of comma expression has no effect. If there is no comma operator, then warning: statement with no effect is emitted. The second problem is the comma operator. The lesson is that you should be using -Wall.

We've already seen that commas inside a function call are not order-keeping comma operators. It turns out that people really do use semicolon statement terminators, but seldom use comma operators. They are handy in the creation of macros. The comma operator can keep multiple statements executed together, without introducing additional block structure. This turns out to be important, because there are places where a statement can go that a new block can't. With the advent of compilers that can inline functions, there is at least less need for this kind of macro.


main() {
    int var = 2;
    switch (a) {
    int var = 1;    /* This initialization typically does not happen. */
    default:
        printf("var=%d\n", var);
    }
}

This is stunning. The -Wall option complains that the first var is unused. This is true enough. It complains that the second var is used uninitialized. And sure enough, it prints stack garbage. I recall vaguely that the original C standard did not allow initialization of variables after the start of a function. That is, you could declare them, but not also give them a value. What could the compiler do? The inner var could be allocated at the same time that the outer variable is declared, at the start of the function. It could be set just before the switch is executed. Given that it allows the syntax without comment, that is what it should do. However, if it isn't going to do anything, it shouldn't accept the syntax. It's a bug, perhaps in modern gcc (4.2.1). No possible good can come from this.

Why is this a problem? C has always worked this way, right? However these days, there many C like languages - C++, Java, and so on, and many do allow this syntax. Many of us program in multiple languages, so it's much more likely that this is an issue. It is NOT a bug in these other languages.


#define DEVICE_COUNT 4 
uint8 *szDevNames[DEVICE_COUNT] = {
        "SelectSet 5000",
        "SelectSet 7000"}; /* table has two entries of junk */

While it is unfortunate that two entries in the allocated table are junk, they are no more junk than if no entries were initialized. I've personally used this language feature to initialize part of an array, that is later filled in completely. Now, when i've done it, the array is allocated in the heap as static or global memory. The unused entries don't have junk, but rather have zeroes. And, of course, my program had logic that prevented misuse. So, while C let's you shoot yourself in the foot, it also lets you do things that are difficult in other languages for some benefit - perhaps some performance reason.