
Josep Torrellas
A Faulty Parallel Computer for Everyone
As all processor companies-Intel, IBM, AMD, Sun-are finally marketing chips with multiple processors in them (Chip Multiprocessors or Multicores), Josep Torrellas is seeing how his work and that of his colleagues is being applied. The challenges are less about performance, which was the mantra a decade ago, and more about reliability of the hardware and ease of programming.
Making a Silk Purse from a Sow's Ear
As we demand more and more from the materials with which we build the Multicore chips-the transistors-they become increasingly unreliable. A chip's transistors will come in a variety of speeds and power consumption levels. As a result, sections of the chip will be faster than others, and sections will be hotter than others. Nevertheless, users expect the Multicore chips to work smoothly and reliably. For computer architects like Torrellas, building reliable computers on top of flaky transistors will be the great challenge of the next decade. The problem is like getting a sow's ear from the circuit fabrication team, and selling it as a silk purse to computer users. Users expect the smooth texture of silk from a material that is rough and ugly.
"The key idea is to accept the fact that transistor variability exists and that one cannot design the Multicore for the worst-case transistor," Torrellas said. "Doing this-designing so that all transistors behave like the slowest of transistors-would be a waste of potential." Instead, in his VATS project, Torrellas proposes designing for the average transistor. The result will be a bunch of transistors on the chip that will never work, while others will cease to work temporarily or permanently at a certain point. The job of the computer architect is to add vast amounts of transistors that perform redundant operations and that continuously check for faults and disable the bad actors. In a sense, we are buying more of a faulty commodity to be able to get the computer to work. Multicore chips, therefore, will have plenty of "useless" real estate, which does not contribute to the program's performance, and it might even reduce it sometimes, but at least it ensures that the system runs smoothly.
Who Has the Time to Verify that your Processor Works?
Designing Multicores is a complicated business. Their minute hardware parts all have to follow complicated protocols to interface with each other and with the outside world. Existing techniques to verify that a processor has no design defect rely on running exhaustive (and exhausting) tests before shipment. Every time a malfunction is found (e.g., some buffer is amiss and produces the wrong result), many of the company's resources are devoted to the painstaking task of trying to reproduce the bug and fix it.
"We need to simplify the task of verifying the hardware," said Torrellas, "and accept that some design defects will escape to the field." Torrellas designed hardware extensions called CADRE that enable the Multicore and its entire computer board (memory controller, input/output interfaces) to be fully cycle-deterministic. This means that if you put the same input signals to the board and start it in the same state, it will exactly reproduce what happened, cycle by cycle. This enables designers to re-run a test that uncovered a design defect and reproduce it thereby simplifying their debugging task enormously. Moreover, Torrellas has designed a hardware module called Phoenix for Multicore chips that can be programmed in the field to fix design defects as they are found. "The process," said Torrellas, "is analogous to operating system software patches but for the chip's hardware." When the company discovers a new design defect in its chip, it would broadcast a "hardware-patch" that users could download from the web. The patch would automatically reprogram the Phoenix hardware and fix that design defect for good.
Programming is Child's Play
None of the advances in Multicore hardware will matter if programmers find it too difficult to program parallel processors. In the past, parallel programming was limited to a small population. With the arrival of Multicores, many more people will be able to leverage parallel processing. Torrellas has made it his mission to design architectures that help simplify the task of writing parallel programs.
"The architecture should off-load some of the tasks that
make parallel programming complex," said Torrellas, "and also
help the programmer pinpoint problems in the software." The main
challenge for the programmer is to make sure that all concurrent
operations in a program (the "threads"), can execute at the same
time without interfering with each other-not an easy task. In one
of Torrellas's schemes called Colorama, the programmer declares
the data structures that should be accessed by only one thread at
a time, and the hardware automatically "protects" them-blocking
accesses if more than one thread wants to access them
concurrently. This relieves the programmer from having to encode
such protection in the program. If, instead, the programmer
chooses to encode all the actions in the program, special
hardware called ReEnact checks in the background for software
errors as the program executes. Such errors typically manifest as
data races, a bug resulting from unexpected interleaving of
operations from different threads that try to access the same
shared variable. "We provide support to identify data races in
programs and may even fix some of them on the fly," said
Torrellas. "The promise of the hardware is that it can perform
checks in the background with very low overhead."
Written by Judy Tolliver,
Aug 2, 2006
--
Last Modified August 14 2006 12:28:22.