Sunday, August 01, 2010

Practical hyperthreading

I recently read some inconsistent material concerning Intel's CPUs. It had to do with hyperthreading.

The idea behind hyperthreading is that you have more than one set of CPU registers (including hidden registers) so that it is very quick for the CPU to switch from one process to another. In fact, it can be done between every instruction. That is, if there are two processes currently runable, the CPU can execute instructions from alternating processes.

There are a couple of reasons one might want to do this. One might want to have separate state for operating system kernel instructions and user level instructions. One might have separate state so that interrupt routines would run quickly. No need to save the state, just use registers dedicated to running interrupt service routines. This was done for Digital's PDP-10 computer back in the 1970's.

But there is a problem for modern machines that's a little different. It's the memory wall. Eventually, the bottleneck for Von Neumen architecture CPUs is the communication of data between the CPU and main memory. One can delay this bottleneck for awhile, and this has been done, but it will eventually come up and smack you in the face. And these days, CPUs are much faster than main memory. So, while the CPU may execute instructions at three billion per second, main memory takes at least several nanoseconds to respond to a request. OK, so most memory references only go as far as the on-CPU chip cache. These requests may be satisfied in as little as a single cycle. But to go all the way out to main memory can take what seems like forever. At least, forever if you're a fast CPU. It can be over a hundred cycles.

So the idea is, have more than one process running. When an instruction is executed that fetches data from main memory, the CPU might have to wait for the result before the next instruction is executed. However, if the CPU switches to an entirely different process, then that processes' next instruction can't be waiting for this result. There's a better chance that it can proceed without waiting at all. If the CPU is idle less often, then it is doing more useful work per unit time. It's faster. For Intel, this is usually about 20% faster. That is, you get an extra 20% more cycles per unit time.

However. Let's say you have two processes. Each process will get about half of the available cycles. If the total is 120%, then each process will run at about 60% of the original speed. Yes, that's right, the total throughput is higher, but a single processor will run a single process faster. But consider that a single processor will run two processes at 50% each, rather than 60% each. Still, people worried about speed often want their single process to run as fast as possible. Can one get the best of both worlds?

Yes. Often, there is inherent parallelism available within an application. The operating system supports something called threads. An application can have two threads running at the same time. Both threads have access to all the memory of the application. And in a hyperthreading environment, both threads can contribute to the performance of the single application. Therefore, a single application can get the speed boost offered by hyperthreading. It requires more effort on the part of the programmer. The result is usually more complicated, and can be more difficult to debug (get right). But it can, and often is, done.

When hyperthreading became available, i fired up a benchmark, timed a run of one copy. Then timed a run of two copies at the same time. Then, i went into the BIOS, turned on hyperthreading, and reran both tests. With hyperthreading turned off, the results were 100% speed with one process, and 50% for each with two simultaneous processes. With hyperthreading turned on, the results were 100% speed with one process and 60% for each with two processes. There was no additional gain to be had in total bandwidth for more than two processes with my simple benchmark. The benchmarks perform a fixed amount of work. So by 50%, i mean that this work load takes twice as long (wall clock) to execute. By 60% speed, i mean that this work load takes 1.66 times as long measured by the wall clock (1 / 1.66 = 0.60). Very simple.

But i started this article talking about confusion i've seen. One of the things i've heard stated is that if you turn on hyperthreading, your speed is immediately cut in half. This may be due to the way that the tools report your performance. We pretty much have an idea what 100% means if there is a single CPU with no hyperthreading. 100% use means that the CPU is totally consumed. But with hyperthreading turned on, some tools report 100% if two threads are executing the entire time. And if only one process is running, these tools often report 50%. However, in this later case, the CPU isn't idle. It's getting 83% (100 / 120) as much work done as is possible with this CPU. But this is exactly as much total work as the CPU would have done if hyperthreading were turned off.

And it gets worse. Some tools report 200% instead of 100%, as above. That's on the same running system. With some tools reporting 100% and others reporting 200%, it's a royal pain to compare results. And those reporting up to 100% often end up reporting 102% from time to time.

And it gets worse still. The operating system reports the CPU time that a process uses based on the runable time and the number of processes that were runable at the time. But the performance during that time can vary by 20%. So, CPU time doesn't measure total cycles delivered very accurately or repeatably. Well, with demand paging, this has been true for awhile anyway. Page replacement interrupts, TLB replacement interrupts, and even I/O interrupts all take their toll on accounting. So, IMO, it's not that much of a loss.

My new 4 core AMD Phenom II does not appear to support hyperthreading. I wish it did. But it still suffers a bit from poor accounting. My operating system tools sometimes report up to 400% CPU utilization, and sometimes report up to 100% CPU utilization.

And yet, there is a downside to hyperthreading. It has to do with priority. I often run a very long running background process, with the priority set as poor as possible. And Unix (or Linux) will typically give this process nearly 100% of the CPU when nothing else is running. And if there is a normal priority process running, then the background process will get 5%, with the foreground process getting 95%.

But with hyperthreading turned on, two processes may run at full speed because the operating system treats threads as CPUs. Since there are two CPUs (there aren't, really), the operating system lets them both run at full speed. That means that each process gets 60% of the single CPU speed. That's much less than 95% for the normal priority process, and much more than 5% for the low priority process. And there are times when i'm impatient enough to want that extra 35%. In fact, it's been awhile, but at one time i ran a Unix variant that would give the normal priority process 100% percent, with no cycles at all going to the low priority process. I miss those days. They were nice, or is it not so nice?


Anonymous said...

I know this post was made in 2010 but is a very good overview of the confusion that is still around regarding hyperthreading technology. I searched for posts like this after a recent (January-2013) argument with a best-buy sales man who tried to convince me that ht CPU gives you a 100% CPU for all the virtual CPUs. After my initial attempt to explain that this is simply not true we ended up in an argument. I gave up on him since apparently any reasoning and technology explanations I gave him sort of failed to score a mark in his head. I left with a feeling that I encountered an obscure cult church adept instead of a tech savy best-buy employee. Anyway, thanks for article.

Stephen said...

You're welcome.

It's true for many that once an idea gets into your head, it can be really hard to replace it with a better idea. It takes some effort to evaluate the new idea, and apparently even more effort to replace it. The good news is that it gets easier with practice. The hope is that progress can be made without the requirement of death.

Stephen said...

I recently (re)read the Linux "top" manual page. It talks about Solaris mode or Irix mode (neither of these Unix like operating systems are Linux, but Top is portable). The difference is that one reports a process using all cores of a 4 core processor as 100% and the other reports it as 400%. Hyperthreading makes this even more complicated, since a single core running more than one process (or thread) than the same core can get if only one process is running on it.

Update: The new AMD, Ryzen, can do hyperthreading.