Inside a Vector CPU
From Cray Wiki
Just what was it that made the Cray CPUs so fast? Putting aside the fact that the logic was implemented in fast bipolar hardware, there were a number of features that, combined with clever compiler technology, made the processors speed through the type of scientific and engineering problems that were the heartland of Cray customers. Described in this section are some to the features that made the difference in both speed and price.
Registers: lots of them, in a YMP CPU for example
- 8 V registers
- each 64 words long
- each word 64 bits
- 64 T registers
- each 64 bits
- 8 S registers
- each 64 bits
- 64 B registers
- each 32 bits
- 8 A registers
- each 32 bits
- 4 Instructions buffers 32 64-bit words (128 16-bit parcels)
YMP functional units were:
- address: add multiply
- scalar: add, shift, logical, pop/parity/leading zero
- vector: add, shift, logical, pop/parity
- floating: add, multiply, reciprocal approximation
Other sundry CPU registers are Vector mask, Vector length, Instruction issue registers, performance monitors, programmable clock, Status bits and finally exchange parameters and I/O control registers. The quantity and the types of registers evolved and expanded through the life of the CPU types. The C90 added more functional units to the YMP design and the T90 even more still.
Pipelining: This technique breaks down the stages required for an arithmetic operation into a serial line of sequential stages. As each number passed through a pipeline stage the next calculation enters. By the time that the first number exits from the pipeline, at 20 clock cycles for a 10 stage pipeline, the second number will be just two cycles behind. Measuring the time taken to process 10 numbers, the total would work out at 20, for the first number + 2 * 9 for the rest totalling 38 cycles against 10 * 20 = 200 cycles for the same operation if pipelining was not employed. This technique is especially effective when combined with Vector operations.
Vector operations: The CPUs have vector registers that could hold an array of numbers. The instruction set of the CPU had operations such as set VL=60 ; Vc = Va + Vb; which would decode as, take the first 60 numbers from vector register a, add them to the corresponding numbers in vector register b and put the answers in vector register C. This operation corresponds to the common FORTRAN construct:
do i = 1,60 c[i] = a[i] + b[i]done
This and other similar constructs executed at phenomenal speed on the Cray architecture making it eminently suitable for scientific and engineering processing. The compiler technology was tuned over the years to detect many source code constructs and generate the fastest machine code implementations. Vectorization speed ups can be detected in vectors as short as 3 or 4 numbers.
Functional units: The CPUs had independent function units so that parallel operations could be active between unrelated registers at the same time. The code:
do i = 1,60 c[i] = a[i] + b[i] d[i] = e[i] * f[i]done
would complete with a few cycles of the code fragment above. The 64 bit functional units in a YMP were firstly an Add/subtract/Logical vector unit. Next a Floating point Add/subtract/multiply/reciprocal unit that could be either vector or scalar. Next a scalar add/subtract logical/shift/pop/leading zero unit. Finally the 32 bit address functional unit was for add/subtract/multiply on 32 bit address operands.
Daisy-chaining: This technique, used for more complex vector operations, could exploit the combined independent functional units in a daisy-chain allowing operations such as :
do i = 1,60 c[i] = a[i] + b[i] * d[i]done
to complete within a few cycles of the code fragments above. A similar type of process known as tailgating was used in Cray-2 CPUs.
Gather/scatter/conditional vector: These special operations, used in sparse matrix operations, applied one vector register as the table index to another. A gather operation could appear in FORTRAN as:
do i = 1,60 c[i] = a[ b[i] ]done
A conditional vector operation might appear in src code as
do i = 1,60 if ( b[i] gt 0 )then c[i] = a[ i ] fidone
These special operations executed at peak speed for the processor making it possible to achieve near peak speeds on real life codes. A feat that seems to have been forgotten in the modern days where processors often fail to achieve even 10% of their theoretical peak speed on non-trivial real life codes.
Memory interface: CPUs are faster than memory so the speed at which a processor can exchange information with memory limits the effectiveness of the processor. This can strangle the performance of an architecture so a simple solution to halve the memory delay is to have two independent banks of memory. Taking this further, having enough memory banks to match the ratio of memory speed to CPUs speed, would remove the memory refresh speed delay. For example if a CPU has an 8.5 Nano second clock cycle time and the memory banks have a refresh time of 68 Nano seconds and there are 16 memory banks an operation such as
do i = 1,60000 c[i] = a[i] + ndone
can run at full speed. Even in modern processors the above operation would become memory bound as soon as the processors cache was exhausted. As well as multiple banks there were multiple ports to memory from each CPU to prevent bus contention. Looking at this from another view, sequential memory locations come from separate memory banks. As the architecture developed the number of banks and ports increased along with vector length.
Location, 0,1,2,3,4,5,6,7,8,9,A,B,C,... bank, 0,1,2,3,4,5,6,7,0,1,2,3,4,... (It was not quite as simple as this in the hardware, but you get the idea.)
This memory bank architecture also accounted for machines with identical CPUs, but different memory sizes having different peak speeds. It also explained why a memory board failure could not be worked around by shrinking memory. In the above example removing a memory board would remove every 8th memory location, which is impossible to code round. C90 systems had the ability to map spare memory chips to cover failing memory locations. Later T90s did have the ability to down half memory or some CPUs in the event of a failure.
There is no data cache in a vector CPU, except the registers, but the instruction buffers acted as operation caches. Later J90s did have a small data cache. There was no hardware virtual memory support except partially in T90s. See also SSD.
