Saturday, January 8, 2011

VT Architecture

Parallelism and locality are  the  key application characteristics  exploited  by computer architects to make productive use of increasing transistor counts while coping with wire delay and power dissipation. Conventional sequential ISAs provide minimal  support  for encoding parallelism or locality, so high-performance implementations are forced to devote considerable area and power to on-chip structures that extract parallelism or that support arbitrary global communication. The large area and power overheads are justified by the demand for even small improvements in  performance  on legacy codes  for popular ISAs. Many important applications have abundant parallelism,  however, with dependencies and communication patterns that can be statically determined. ISAs that  expose  more  parallelism reduce the need for area and power intensive structures to extract  dependencies dynamically. Similarly,  ISAs that  allow  locality to  be expressed reduce the need for long range communication and  complex  Interconnect.

                                                 The challenge  is  to develop an efficient encoding  of  an  application’s parallel  dependency graph and to reduce the area and power consumption of the micro architecture that will execute this  dependency  graph. All  these challenges  are  met by  unifying the vector and  multithreaded  execution  models  with  the vector-thread (VT) architectural paradigm. VT  allows large amounts of  structured parallelism  to  be compactly encoded in a form that allows a simple micro architecture to attain high performance  at  low power  by  avoiding  complex  control and   datapath  structures   and by reducing activity on long wires. 
The VT  programmer’s  model  extends  a conventional scalar control processor with an array of slave virtual processors (VPs). VPs execute strings of RISC-like  instructions  packaged  into  atomic  instruction  blocks (AIBs). To execute  data-parallel code,  the control processor  broadcasts AIBs  to  all  the slave VPs. To execute thread parallel code,  each VP  directs  its  own  control  flow by fetching its own AIBs. Implementations of the VT architecture can also exploit instruction-level parallelism within AIBs. In this way, the VT architecture supports a
modeless  intermingling of all forms of application parallelism. This  flexibility  provides new  ways to  parallelize  codes  that  are  difficult  to vectorize  or  that  incur   excessive synchronization costs when threaded.  Instruction locality is  improved  by  allowing common  code  to  be  factored  out and  executed  only once  on  the  control  processor,  and  by  executing  the  same  AIB  multiple  times  on  each  VP  in  turn.   Data  locality  is improved as most operand communication is isolated to within an individual VP.

SCALE, a prototype processor, is an instantiation of the vector-thread architecture designed for low-power and high-performance embedded systems. As transistors have become cheaper and faster, embedded applications have evolved from simple control functions to cellphones that run multitasking networked operating systems with real time video, three-dimensional graphics, and dynamic compilation of garbage collected languages. Many other embedded applications require sophisticated high-performance information processing, including streaming media devices, network routers, and wireless base stations. Benchmarks taken from these embedded domains can be mapped efficiently to the SCALE vectorthread architecture. In many cases, the codes exploit multiple types of parallelism simultaneously for greater efficiency

No comments:

Post a Comment

Related Posts Plugin for WordPress, Blogger...