Reference no: EM1369415
In both area, time, and power, computation is cheap compared to data movement. For this reason, we must use more operands from on-chip memory. Going forward, we must also use more operands from _local_ registers, i.e., ones that serve small regions of the chip, called "cores", each with their own functional units.
With integration, feature sizes become smaller, providing many more logic transistors. But wire delays fail to keep up. If current trends continue, a conventional cache-based processor may spend all its time moving data back and forth between a deeply pipelined processing unit and an extremely large data cache. This would make it difficult to sustain single-cycle access to the cache.
One idea is to communicate directly from one core to another nearby core, using the set of cores as if they were a "reconfigurable pipeline" under software control.
Consider a multicore architecture consisting of 16 (4 X 4) processor cores, where each core is physically square. The cores are connected by a rectangular mesh interconnect (there is a "pin" on the N, S, E, and W side of each core, and short wires connect pins).
The compiler generates code for a medium-sized loop with body B. By good fortune, B can be decomposed into pieces B_j, 1 <= j <= 14. Each B_j takes one input, engages in heavy calculation using the local (core) register
set, and produces one output. Moreover, there is a producer-consumer relationship between each B_j (the producer) and B_j+1 (the consumer).
a) How would you have the compiler assign the 14 subcomputations to the 15 cores? Explain briefly. (Symmetry means there is no unique answer).
b) Also, identify two kinds of locality in this situation. Show the instances of both i) data-reuse locality and ii) proximity locality. Explain.