Saturday, January 8, 2011
Parallelism and locality are the key application characteristics exploited by computer architects to make productive use of increasing transistor counts while coping with wire delay and power dissipation. Conventional sequential ISAs provide minimal support for encoding parallelism or locality, so high-performance implementations are forced to devote considerable area and power to on-chip structures that extract parallelism or that support arbitrary global communication. The large area and power overheads are justified by the demand for even small improvements in performance on legacy codes for popular ISAs. Many important applications have abundant parallelism, however, with dependencies and communication patterns that can be statically determined. ISAs that expose more parallelism reduce the need for area and power intensive structures to extract dependencies dynamically. Similarly, ISAs that allow locality to be expressed reduce the need for long range communication and complex Interconnect.
The challenge is to develop an efficient encoding of an application’s parallel dependency graph and to reduce the area and power consumption of the micro architecture that will execute this dependency graph. All these challenges are met by unifying the vector and multithreaded execution models with the vector-thread (VT) architectural paradigm. VT allows large amounts of structured parallelism to be compactly encoded in a form that allows a simple micro architecture to attain high performance at low power by avoiding complex control and datapath structures and by reducing activity on long wires.
The VT programmer’s model extends a conventional scalar control processor with an array of slave virtual processors (VPs). VPs execute strings of RISC-like instructions packaged into atomic instruction blocks (AIBs). To execute data-parallel code, the control processor broadcasts AIBs to all the slave VPs. To execute thread parallel code, each VP directs its own control flow by fetching its own AIBs. Implementations of the VT architecture can also exploit instruction-level parallelism within AIBs. In this way, the VT architecture supports a
modeless intermingling of all forms of application parallelism. This flexibility provides new ways to parallelize codes that are difficult to vectorize or that incur excessive synchronization costs when threaded. Instruction locality is improved by allowing common code to be factored out and executed only once on the control processor, and by executing the same AIB multiple times on each VP in turn. Data locality is improved as most operand communication is isolated to within an individual VP.
SCALE, a prototype processor, is an instantiation of the vector-thread architecture designed for low-power and high-performance embedded systems. As transistors have become cheaper and faster, embedded applications have evolved from simple control functions to cellphones that run multitasking networked operating systems with real time video, three-dimensional graphics, and dynamic compilation of garbage collected languages. Many other embedded applications require sophisticated high-performance information processing, including streaming media devices, network routers, and wireless base stations. Benchmarks taken from these embedded domains can be mapped efficiently to the SCALE vectorthread architecture. In many cases, the codes exploit multiple types of parallelism simultaneously for greater efficiency
Posted by Josephin Joshy at 10:39 AM