High-Performance Computer Architecture 20 | VLIW and Explicitly Parallel Processors

Series: High-Performance Computer Architecture

High-Performance Computer Architecture 20 | VLIW and Explicitly Parallel Processors

VLIW Processor

(1) Recall: Superscalar Processor

The superscalar processor is a computer architecture where more than one instructions are loaded at once and, as far as possible, is executed simultaneously, shortening the time taken to run the whole program.

(2) VLIW Processor

Unlike the out-of-order super processors, these processors do not try to identify ILP on their own. Even though a very long instruction word (VLIW) processor will also try to execute more than one instructions per cycle, the work is implemented in a different way.

(3) Superscalar Processor Vs. VLIW Processor

Instructions Per Cycle

Both the out-of-order superscalar and the in-order superscalar are trying to do up to N (called an N-issue processor) instructions per cycle. While, for a VLIW processor, this is different because the VLIW processor will only try to do 1 large instruction per cycle, and this large instruction does the same work as N normal instructions.

Find independent instructions

The out-of-order superscalar will look at much more than the number N in its instruction window in order to get up to N instructions per cycle. By contrast, the in-order superscalar processor is trying to execute up to N instructions per cycle by looking at only the next N instructions in the program order. The VLIW processor doesn’t even try to find the independent instructions because it simply tries to do the next large instruction. So it behaves like a non-superscalar processor that uses the instructions in order.

Hardware cost

Because the out-of-order superscalar has to look at more than N instructions for N instructions per cycle, the cost would be expensive. While, because the in-order superscalar only has to look at the next N instructions, so the cost will be less expensive than the out-of-order superscalar. However, because the VLIW reduces the cost even further by only looking at the next large instruction, it is even less expensive than an in-order superscalar.

Importance of the compiler’s support

For an out-of-order superscalar, the compiler can how to improve the program performance. But even without a compiler’s help, the out-of-order superscalar can have good performance. However, an in-order superscalar is more dependent on the compiler if the compiler doesn’t do anything (like instruction scheduling) to put independent instructions consecutively, then an in-order superscalar will not have a good performance. Finally, a VLIW processor completely depends on the compiler to produce performance. If we don’t have support from the compiler, the VLIW processor will fail miserably as far as performance is concerned.

(4) VLIW Evaluation

a. Benefits

the compiler does the hard work, so the penalty of optimization (figuring out a good schedule) only takes one time.
can have simpler hardware compared with an out-of-order processor
can be energy efficient
works well on loops and regular codes like sweeping through arrays, multiplying matrics

b. Downsides

latencies are not always the same (i.e. a cache miss)
not compatible with the irregular applications (i.e. AI applications, applications that work with pointers that are hard for a compiler to figure out)
code bloat: the code for a VLIW processor can be much larger than the code for a normal OOO processor because we are adding a lot of NOPs in order to prevent dependencies

(5) VLIW and Backward Compatibility

If we do want backward compatibility to improve our VLIW, then the wider processor will not be a VLIW processor anymore. Because instead of using a NOP to deal with the dependencies, the wider processor will have to check the dependencies between the VLIW instructions, which makes it seem like a normal superscalar processor.

(6) VLIW Instructions

VLIW instructions have all the usual ISA opcodes

The instructions set of a VLIW processor typically has all the normal ISA opcodes. So each of the VLIW instructions can typically do whatever normal instructions can do.

Fully support predication

This is because the VLIW relies on the compiler to expose parallelism and one of the ways the compiler does that is through scheduling instructions. If the compilers are able to do predications, we have more opportunities for instruction scheduling.

Require many registers because of the scheduling optimizations
Branch hints because the compiler needs to tell the hardware its predictions
VLIW instruction compaction

Instead of using NOPs for empty instruction slots there are stops. This reduces the number of instructions required, thus reducing code bloat.

(7) VLIW: Examples

Itanium

Intel’s Itanium is typically a VLIW processor and it has tons of ISA features. As a result, the hardware of a VLIW processor becomes very complicated and it no longer has any needs to check for the dependencies between the instructions.

(Digital Signal Processing) DSP Processors

The DSP does a lot of floating-point work typically in a very regular loop with lots of iterations and each iteration has only a small amount of work. So typically, these types of processors on these types of codes get excellent performance and they are very energy efficient because they don’t spend much power on figuring out dependencies. So in this case, the VLIW is a good choice now.