Energy Efficiency has become an important metric of performance in recent years. The application and problem scale is increasing exponentially every year leading to an enormous amount of data to be processed. This translates to high power and energy consumption. Therefore, the ability to implement the applications in an energy efficient way is one of the leading problems being addressed in research and industry.
The conventional processors such as CPUs consume large amount of energy and cannot be optimized to suit the target applications. On the other hand, GPUs are programmable, but consume even higher amount of energy. FPGAs offer a middle ground among the platforms with high programmability and energy efficiency without sacrificing the throughput of the application.
Our group focuses on creating algorithmic innovations and optimizations at the architecture-algorithm abstraction for the emerging landscape of FPGA based acceleration for different performance metrics such as latency, throughput and energy efficiency.
Various technologies are emerging in the field of computing. On the processor side, heterogeneous platforms consisting of a generic CPU with an accelerator is touted as the future of computing. On the memory side, 3D memory is the solution to break the infamous memory wall of DRAM, whereas XPoint is expected to narrow the bandwidth gap between flash memory and DRAM. The next generation architectures will include a combination of the above. It is important to analyze their performance to identify the range and scale of target applications suitable for these architectures.
We create high level performance models to abstract these futuristic architectures and hide the intricacies of the underlying architecture from the algorithm designer. This would help in identifying bottlenecks in performance to develop algorithmic optimizations. Additionally, the designers can perform quick simulations to analyze the performance of algorithm-architecture mapping. Optimizations at the algorithmic level has a much higher impact on performance compared with the microarchitecture or circuit level optimizations. Our models allow the designer to analyze tradeoff between various performance parameters. Further, our optimizations can be applied on top of other optimizations developed at the microarchitecture or circuits to achieve higher performance.
Permutation is a common operation used in various signal processing algorithms when implementing on FPGAs. Additionally, due to their high data-rate and simple control, streaming architectures have become popular for hardware implementation of data intensive applications. A key problem in designing such architectures is to permute streaming data. Permuting a long data sequence through hardware wiring leads to high area consumption and routing complexity. A more scalable approach is to build a hardware structure to permute streamed data inputs. We have developed a resource efficient mapping method using the classic Benes and Clos networks to generate optimal implementations for applications such as Sorting and Arbitrary permutation generation. Our designs minimize the amount of resources compared with the state-of-the-art and support streaming data processing, thus achieving higher energy efficiency while sustaining the same throughput.
As the scale of applications and problem sizes increase, the memory power becomes a significant component of the overall power. The access pattern of an application determines the power dissipated by the external memory such as DDR3 and 3D Memory. Further, the memory bandwidth and hence the overall throughput of the application depends on the access pattern of the application and the data layout of the memory.
We are exploring power optimization of memory in order to improve the energy-efficiency of the entire system. We formulate optimal data layouts for external memory based on the target applications which minimizes the number of random external memory accesses and enables an efficient memory activation schedule to reduce memory power. We develop parallel architectures on FPGA which saturate the external memory bandwidth and achieve high clock rate thereby achieving high throughput. Further, for the designs which consume large amount of on-chip memory resources, on-chip memory power dominates the overall FPGA power consumption. We employ memory activation scheduling for on-chip memory by activating only the necessary portion of memory to reduce the energy consumption.
The following papers may have copyright restrictions. Downloads will have to adhere to these restrictions. They may not be reposted without explicit permission from the copyright holder.