Cray 2

From Cray Wiki

Jump to: navigation, search



The Cray-2 was a vector supercomputer made by Cray Research starting in 1985. It was the fastest machine in the world when it was released, replacing Cray's own X-MP in that spot. The Cray-2 was capable of 1.9 GFLOPS peak performance and was only bumped off of the top spot by the ETA-10G in 1990.

With the successful launch of his famed Cray-1, Seymour Cray turned to the design of its successor. By 1979 he had become fed up with management interruptions in what was now a large company, and as he had done in the past, decided to resign his management post and move to form a new lab. As with his original move to Chippewa Falls, Wisconsin from Control Data HQ in Minneapolis, MN, Cray management understood his needs and supported his move to a new lab in Boulder, Colorado. Working as an independent consultant at these new Cray Labs, he put together a team and started on a completely new design. This Lab would later close, and a decade later a new facility in Colorado Springs would open.

Cray had previously attacked the problem of increased speed with three simultaneous advances: more functional units to give the system higher parallelism, tighter packaging to decrease signal delays, and faster components to allow for a higher clock speed. The classic example of this design is the CDC 8600, which packed four CDC 7600-like machines based on ECL logic into a 1 x 1 meter cylinder and ran them at an 8 ns cycle speed (125 MHz). Unfortunately the density needed to achieve this cycle time led to the machine's downfall. The circuit boards inside were densely packed, and since even a single malfunctioning transistor would cause an entire module to fail, by packing more of them onto the cards the odds of failure greatly increased.

One solution to this problem, one that most computer vendors had already moved to, was to use integrated circuits (ICs) instead of individual components. Each IC included a selection of components from a module pre-wired into a circuit by the automated construction process. If an IC didn't work, you simply threw it away and tried another. At the time the 8600 was being designed the simple MOSFET-based technology simply didn't offer the speed Cray needed. Relentless improvements changed things by the mid-1970s, however, and the Cray-1 had been able to use newer ICs and still run at a respectable 12.5 ns (80 MHz). In fact, the Cray-1 was actually somewhat faster than the 8600 because it packed considerably more logic into the system due to the IC's small size.

Although IC design continued to improve, the physical size of the ICs was constrained largely by mechanical limits; the resulting component had to be large enough to solder into a system. Dramatic improvements in density were possible, as the rapid improvement in microprocessor design was showing, but for the type of ICs used by Cray, ones representing a very small part of a complete circuit, the design had plateaued. In order to gain another 10-fold increase in performance over the Cray-1, the goal Cray aimed for, the machine would have to grow more complex. So once again he turned to an 8600-like solution, doubling the clock speed through increased density, adding more of these smaller processors into the basic system, and then attempting to deal with the problem of getting heat out of the machine.

Another design problem was the increasing performance gap between the processor and main memory. In the era of the CDC 6600 memory ran at the same speed as the processor, and the main problem was feeding data into it. Cray solved this by adding ten smaller computers to the system, allowing them to deal with the slower external storage (disks and tapes) and "squirt" data into memory when the main processor was busy. This solution no longer offered any advantages; memory was large enough that entire data sets could be read into it, but the processors ran so much faster than memory that they would often spend long times waiting for data to arrive. Adding four processors simply made this problem worse.

To avoid this problem the new design banked memory and two sets of registers (the B- and T-registers) were replaced with a 16 Kword block of the very fastest memory possible called a Local Memory, not a cache, attaching the four background processors to it with separate high-speed pipes. This Local Memory was fed data by a dedicated foreground processor which was in turn attached to the main memory through a Gbit/s channel per CPU; X-MPs by contrast had 3, for 2 simultaneous loads and a store and Y-MP/C-90s had 5 channels to avoid the von Neumann bottleneck. It was the foreground processor's task to "run" the computer, handling storage and making efficient use of the multiple channels into main memory. It drove the background processors by passing in the instructions they should run via eight 16 word buffers, instead of tying up the existing cache pipes to the background processors. Modern CPUs use a variety of this design as well, although the foreground processor is now referred to as the load/store unit and is not a complete machine unto its own.

Main memory banks were arranged in quadrants to be accessed at the same time, allowing programmers to scatter their data across memory to gain higher parallelism. The downside to this approach is that the cost of setting up the scatter/gather unit in the foreground processor was fairly high. Stride conflicts corresponding to the number of memory banks suffered a performance penalty (latency) as occasionally happened in power-of-2 FFT-based algorithms. As the Cray 2 had a much larger memory than Cray 1's or X-MPs, this problem was easily rectified by adding an extra unused element to an array to spread the work out.

Packed circuit boards and new design ideas

Cray-2 models soon settled on a design using large circuit boards packed with ICs. This made them extremely difficult to solder together, and the density was still not enough to reach their performance goals. Teams worked on the design for about two years before even Cray himself "gave up" and decided it would be best if they simply canceled the project and fired everyone working on it. Les Davis, Cray's former design collaborator who had remained at Cray headquarters, decided it should be continued at low priority. After some minor personnel movements the team continued on much as before.

Six months later Cray had his "eureka" moment. He called the main engineers together for a meeting and presented a new solution to the problem. Instead of making one larger circuit board, each "card" would instead consist of a 3-D stack of eight, connected together in the middle of the boards using pins sticking up from the surface (known as "pogos" or "z-pins"). The cards were packed right on top of each other, so the resulting stack was only about 3 inches high. With this sort of density there was no way any conventional air-cooled system would work; there was too little room for air to flow between the ICs. Instead the system would be immersed in a tank of a new inert liquid from 3M, Fluorinert. The cooling liquid was forced sideways through the modules under pressure. The heated liquid was cooled using chilled water heat exchangers and returned to the main tank. Work on the new design started in earnest in 1982, several years after the original start date.

While this was going on the Cray X-MP was being developed under the direction of Steve Chen at Cray headquarters, and looked like it would give the Cray-2 a serious run for its money. In order to address this internal threat, as well as a series of newer Japanese Cray-1-like machines, the Cray-2 memory system was dramatically improved, both in size as well as the number of "pipes" into the processors. When the machine was eventually delivered in 1985 the delays had been so long that much of its performance benefits were due to the fast memory, and the machine only really made sense to purchase for users with huge data sets to process.

The first Cray-2 delivered possessed more physical memory (256 MWord) than all previously delivered Cray machines combined. Simulation moved from a 2-D realm or coarse 3-D to a finer 3-D realm because computation did not have to rely on slow virtual memory. This inability to trade space (memory) for time (speed) is what defines supercomputation (extreme, high-end computing).

Uses and Successors

The Cray-2 was predominantly developed for the United States Departments of Defense and Energy. Uses tended to be for nuclear weapons research or oceanographic (sonar) development. However, the Cray-2 also found its way into civil agencies (such as NASA Ames Research Center), universities, and corporations worldwide.

The Cray-2 would have been superseded by the Cray-3, but due to development problems only a single Cray-3 was built and it was never paid for. The spiritual descendant of the Cray-2 is the Cray X1, offered by Cray.


Due to the use of liquid cooling, the Cray-2 was given the nickname "Bubbles", and common jokes around the computer made reference to this unique system. Gags included "No Fishing" signs, cardboard depictions of the Loch Ness Monster rising out of the heat exchanger tank, plastic fish inside the exchanger, etc.


Total immersion cooling Vector super. Featured very large memories and a compact size.

Background processors

2 2 4 4 4

Memory Mwords


Image:Cray2 tn.jpg

Cray 2 and Cray 3 Instruction set Notes

Cray-2 and Cray-3

The Cray-2 computer system was introduced by Cray Research in 1983. It was a shared-memory multiprocessor from the start, shipping initially with 4 high-performance vector "background processors", one simple "foreground processor" for I/O and system controller code, and a very large shared memory. Only the background processor instruction set is described here.

The background processor's programming model comprises a Program Counter and:

  • 8 32-bit address (A) registers
  • 8 64-bit scalar (S) registers
  • 8 64-bit vector (V) registers, each with 64 elements
  • 1 64-bit vector mask (VM) register
  • 1 6-bit vector length (VL) register
  • 16K 64-bit Local Memory (LM) words
  • 1 1-bit Semaphore shared with other processors

Memory is addressed, like the Cray-1, in units of 64-bit words for data references and in units of 16-bit instruction parcels for instruction fetches. Instruction parcels are mapped in big-endian order into their words and denoted by an octal word address suffixed with a letter 'a'-'d'.

No register has special values, unlike B0 on the CDC 6600 and A0 or S0 on the Cray-1. Any A or S register can be tested in a conditional jump.

Only S and V registers can be loaded or stored from/to real memory. The A registers had to be copied through the S registers. There is also no path between Local Memory and real memory, either. Vector registers must be used to implement block copies.

  • octal assembly description
  • 000x00 ERR error exit
  • 000xjk EXIT jk normal exit (system call to foreground)
  • 001000 CMR wait until memory port quiet
  • 002i-k R,Ai Ak return jump to Ak, return address to Ai
  • 003---mn J mn unconditional jump
  • 004---mn JCS mn jump if semaphore clear (and set it)
  • 005---mn JSS mn jump if semaphore set (else set it)
  • 006--- SSM set semaphore
  • 007--- CSM clear semaphore
  • 010--kmn JZ Ak,mn jump if Ak zero
  • 011--kmn JN Ak,mn jump if Ak nonzero
  • 012--kmn JP Ak,mn jump if Ak >= 0 (sign bit clear)
  • 013--kmn JM Ak,mn jump if Ak < 0 (sign bit set)
  • 014-j-mn JZ Sj,mn jump if Sj zero
  • 015-j-mn JN Sj,mn jump if Sj nonzero
  • 016-j-mn JP Sj,mn jump if Sj >= 0 (sign bit clear)
  • 017-j-mn JM Sj,mn jump if Sj < 0 (sign bit set)
  • 020ijk Ai Aj+Ak address add
  • 021ijk Ai Aj-Ak address subtract
  • 022ijk Ai Aj*Ak address multiply
  • 024ij- Ai Sj transfer Sj to Ai
  • 025i-- Ai VL read vector length (mod 64)
  • 026ijk Ai jk,S,P immediate load (6 bits, zero filled)
  • 027ijk Ai jk,S,M immediate load (6 bits, one filled)
  • 030--k VM Vk,Z test Vk for zero elements
  • 031--k VM Vk,N test Vk for nonzero elements
  • 032--k VM Vk,P test Vk for elements >= 0 (sign clear)
  • 033--k VM Vk,M test Vk for negative elements (sign set)
  • 034-j- VM Sj copy Sj to vector mask
  • 035--0 DRI disable memory addressing error interrupt
  • 035--1 ERI enable memory addressing error interrupt
  • 035--2 DFI disable floating-point interrupt
  • 035--3 EFI enable floating-point interrupt
  • 036--k VL Ak set vector length [note!!]
  • 040i--m Ai m,P,P immediate load (16 bits, zero filled)
  • 041i--m Ai m,P,M immediate load (16 bits, one filled)
  • 042i--mn Ai mn,H immediate load (32 bits)
  • 044i--m Ai [m] load Ai from Local Memory (direct)
  • 045--km [m] Ak store Ak to Local Memory (direct)
  • 046i-k Ai [Ak] load Ai from Local Memory (indexed)
  • 047-jk [Ak] Aj store Aj to Local Memory (indexed)
  • 050i--mn Si mn,H,P immediate load (32 bits, zero filled)
  • 051i--mn Si mn,H,M immediate load (32 bits, one filled)
  • 052i--mn Si mn,L immediate load upper 32 bits (zero fill)
  • 053i--mnop Si mnop,F immediate load 64 bits
  • 054i--m Si [m] load Si from Local Memory (direct)
  • 055-j-m [m] Sj store Sj to Local Memory (direct)
  • 056i-k Si [Ak] load Si from Local Memory (indexed)
  • 057i-k [Ak] Si store Si to Local Memory (indexed)
  • 060ijk Si (Aj,Ak) load Si
  • 061ijk (Aj,Ak) Si store Si
  • 062i-k Si (Ak) load Si
  • 063i-k (Ak) Si store Si
  • 064i-kmn Si (Ak,mn) load Si
  • 065i-kmn (Ak,mn) Si store Si
  • 066i--mn Si (mn) load Si
  • 067i--mn (mn) Si store Si
  • 070ijk Vi (Aj,Ak) load Vi from Aj, stride Ak
  • 071ijk (Aj,Ak) Vi store Vi to Aj, stride Ak
  • 072ijk Vi (Ak,Vj) gather Vi from Ak, offsets Vj
  • 073ijk (Ak,Vj) Vi scatter Vi to Ak, offsets Vj
  • 074i-k Vi [Ak] load Vi from Local Memory, stride 1 only
  • 075i-k [Ak] Vi store Vi to Local Memory, stride 1 only
  • 076--- PASS canonical no-op
  • 100ijk Si Sj&Sk AND
  • 101ijk Si #Sk&Sj AND with complement
  • 102ijk Si Sj\Sk XOR
  • 103ijk Si Sj!Sk OR
  • 104ijk Si Sj+Sk integer add
  • 105ijk Si Sj-Sk integer subtract
  • 106ij0 Si PSj population count
  • 106ij1 Si QSj parity (low bit of pop count)
  • 107ij- Si ZSj leading zero count
  • 110ijk Si Si<jk left shift
  • 111ijk Si Si>jk logical right shift
  • 112ijk Si Si,Sj<Ak left shift with fill from Sj
  • 113ijk Si Sj,Si>Ak right shift with fill from Sj
  • 114i-- Si VM read vector mask
  • 115i-- Si RT read real-time clock
  • 116ijk Si jk,S,P immediate load (6 bits, zero fill)
  • 117ijk Si jk,S,M immediate load (6 bits, one fill)
  • 120ijk Si Sj+FSk floating add
  • 121ijk Si Sj-FSk floating subtract
  • 122ijk Si FIX,Sk convert floating to integer
  • 123ijk Si FLT,Sk convert integer to floating
  • 124ijk Si Sj*FSk floating multiply
  • 126ijk Si Sj*ISk reciprocal iteration (2-Sj*Sk)
  • 127ijk Si Sj*QSk recip square root iteration (3-Sj*Sk)/2
  • 130i-k Si Ak transfer Ak to Si, zero fill
  • 131i-k Si +Ak transfer Ak to Si, sign extended
  • 132ij- Si /HSj reciprocal approximation
  • 133ij- Si *QSj reciprocal square root approximation
  • 140ijk Vi Sj&Vk AND
  • 141ijk Vi Vj&Vk
  • 142ijk Vi Sj\Vk XOR
  • 143ijk Vi Vj\Vk
  • 144ijk Vi Sj!Vk OR
  • 145ijk Vi Vj!Vk
  • 146ijk Vi Sj!Vk&VM merge Sj (where VM set) with Vk (where clear)
  • 147ijk Vi Vj!Vk&VM merge Vj (where VM set) with Vk (where clear)
  • 150ijk Vi Vj<Ak left shift
  • 151ijk Vi Vj>Ak logical right shift
  • 152ijk Vi Vj,Vj<Ak continuous left shift (fill from next element)
  • 153ijk Vi Vj,Vj>Ak continuous right shift (fill from prior element)
  • 154ijk Vi Sj*FVk floating multiply
  • 155ijk Vi Vj*FVk
  • 156ijk Vi Vj*IVk reciprocal iteration (2-Vj*Vk)
  • 157ijk Vi Vj*QSk recip square root iteration (3-Vj*Vk)/2
  • 160ijk Vi Sj+Vk integer add
  • 161ijk Vi Vj+Vk
  • 162ijk Vi Sj-Vk integer subtract
  • 163ijk Vi Vj-Vk
  • 164ij0 Vi PVj population count
  • 164ij1 Vi QVj parity (low bit of pop count)
  • 165ij- Vi ZVj leading zero count
  • 166i-k Vi /HVk reciprocal approximation
  • 167i-k Vi *QVk reciprocal square root approximation
  • 170ijk Vi Sj+FVk floating add
  • 171ijk Vi Vj+FVk
  • 172ijk Vi Sj-FVk floating subtract
  • 173ijk Vi Vj-FVk
  • 174i-k Vi FIX,Vk convert floating to integer
  • 175i-k Vi FLT,Vk convert integer to floating
  • 176ijk Vi CI,Sj&Sk compressed index from mask Sj, 32-bit stride Sk

The Cray-3 was a system developed but never successfully produced by Cray Computer Corporation (1989-1995). Its background and foreground processor instruction sets were nearly identical to that of the Cray-2, apart from gratuitous differences in "j"/"k" field usage in some monadic instructions, different floating-point rounding behavior, and the addition of these "bidirectional" vector memory reference instructions:

  • 134ijk Vi <Aj,Ak> load Vi from Aj, stride Ak
  • 135ijk <Aj,Ak> Vi store Vi to Aj, stride Ak
  • 136ijk Vi <Ak,Vj> gather Vi from Ak, offsets Vj
  • 137ijk <Ak,Vj> Vi scatter Vi to Ak, offsets Vj

These have nearly the same semantics as the 070-073 instructions, but with a twist: a 134 or 136 load instruction can run simultaneously with a 135 or 137 store instruction. It is up to the programmer or compiler to guarantee that the parallel address streams are distinct. An S register load or store, a normal 070-073 vector reference, or a CMR instruction serves as a memory barrier.

Cray-2 assembly language code would have assembled and run on the Cray-3 without change (had the machine worked!) but would have produced slightly different floating-point results.

[ pmk - summarize difficulties of Local Memory ??!! ]

Personal tools