DeHon, André
Email Address
ORCID
Disciplines
9 results
Search Results
Now showing 1 - 9 of 9
Publication An NoC Traffic Compiler for Efficient FPGA Implementation of Sparse Graph-Oriented Workloads(2011-01-01) Kapre, Nachiket; DeHon, AndréParallel graph-oriented applications expressed in the Bulk-Synchronous Parallel (BSP) and Token Dataflow compute models generate highly-structured communication workloads from messages propagating along graph edges. We can statially expose this structure to traffic compilers and optimization tools to reshape and reduce traffic for higher performance (or lower area, lower energy, lower cost). Such offline traffic optimization eliminates the need for complex, runtime NoC hardware and enables lightweight, scalable NoCs. We perform load balancing, placement, fanout routing, and fine-grained synchronization to optimize our workloads for large networks up to 2025 parallel elements for BSP model and 25 parallel elements for Token Dataflow. This allows us to demonstrate speedups between 1.2× and 22× (3.5× mean), area reductions (number of Processing Elements) between 3× and 15× (9× mean) and dynamic energy savings between 2× and 3.5× (2.7× mean) over a range of real-world graph applications in the BSP compute model. We deliver speedups of 0.5–13× (geomean 3.6×) for Sparse Direct Matrix Solve (Token Dataflow compute model) applied to a range of sparse matrices when using a high-quality placement algorithm. We expect such traffic optimization tools and techniques to become an essential part of the NoC application-mapping flow.Publication The Case for Reconfigurable Components with Logic Scrubbing: Regular Hygiene Keeps Logic FIT (low)(2008-09-29) DeHon, AndréAs we approach atomic-scale logic, we must accommodate an increased rate of manufacturing defects, transient upsets, and in-field persistent failures. High defect rates demand reconfiguration to avoid defective components, and transient upsets demand online error detection to catch failures. Combining these techniques we can detect in-field persistent failures when they occur and reconfigure around them. However, since failures may be logically masked for long periods of time, persistent failures may accumulate silently; this integration of errors over time means the effective failure rate for persistent errors can exceed transient upset rates. As a result, logic scrubbing is necessary to prevent the silent accumulation of an undetectable number of persistent errors. We provide simple analysis to illustrate quantitatively how this phenomena can be a concern.Publication Fault Tolerant Sublithographic Design with Rollback Recovery(2008-03-19) Naiemi, Helia; DeHon, AndréShrinking feature sizes and energy levels coupled with high clock rates and decreasing node capacitance lead us into a regime where transient errors in logic cannot be ignored. Consequently, several recent studies have focused on feed-forward spatial redundancy techniques to combat these high transient fault rates. To complement these studies, we analyze fine-grained rollback techniques and show that they can offer lower spatial redundancy factors with no significant impact on system performance for fault rates up to one fault per device per ten million cycles of operation (Pƒ = 10-7) in systems with 1012 susceptible devices. Further, we concretely demonstrate these claims on nanowire-based programmable logic arrays. Despite expensive rollback buffers and general-purpose, conservative analysis, we show the area overhead factor of our technique is roughly an order of magnitude lower than a gate level feed-forward redundancy scheme.Publication Introduction to the Special Section on Nano Systems and Computing(2007-02-01) DeHon, André; Lent, Craig S; Lombardi, FabrizioIt is with great pleasure that we introduce the special section on Nano Systems and Computing to the readership of the IEEE Transactions on Computers. This special section consists of five papers that have been selected to cover a wide spectrum of techniques which are encountered in the design of nano-scale computing systems.Publication Pipelining Saturated Accumulation(2005-12-11) Papadantonakis, Karl; Kapre, Nachiket; Chan, Stephanie; DeHon, AndréAggressive pipelining allows FPGAs to achieve high throughput on many digital signal processing applications. However, cyclic data dependencies in the computation can limit pipelining and reduce the efficiency and speed of an FPGA implementation. Saturated accumulation is an important example where such a cycle limits the throughput of signal processing applications. We show how to reformulate saturated addition as an associative operation so that we can use a parallel prefix calculation to perform saturated accumulation at any data rate supported by the device. This allows us, for example, to design a 16-bit saturated accumulator which can operate at 280MHz on a Xilinx Spartan-3 (XC3S-5000-4), the maximum frequency supported by the component's DCM.Publication High Performance, Point-to-Point, Transmission Line Signaling(1998) DeHon, André; Knight, Thomas F.Inter-chip signaling latency and bandwidth can be key factors limiting the performance of large VLSI systems. We present a high performance, transmission line signaling scheme for point-to-point communications between VLSI components. In particular, we detail circuitry which allows a pad driver to sense the voltage level on the attached pad during signaling and adjust the drive impedance to match the external transmission line impedance. This allows clean, reflection-free signaling despite the wide range of variations common in IC device processing and interconnect fabrication. Further, we show how similar techniques can be used to adjust the arrival time of signals to allow high signaling bandwidth despite variations in interconnect delays. This scheme employed for high performance signaling is a specific embodiment of a more general technique. Conventional electronic systems must accommodate a range of system characteristics (e.g. delay, voltage, impedance). As a result, circuit designers traditionally build large operating margins into their circuits to guarantee proper operation across all possible ranges of these characteristics. These margins are generally added at the expense of performance. The alternative scheme exemplified here is to sample these system characteristics in the device's final operating environment and use this feedback to tune system operation around the observed characteristics. This tuning operation reduces the range of characteristics the system must accommodate, allowing increased performance. We briefly contrast this sampled, system-level feedback with the more conventional, fine-grained feedback employed on ICs (e.g. PLLs).Publication Special Issue on the 2006 International Symposium on Field-Programmable Gate Arrays(2007-02-01) Bazargan, Kia; DeHon, AndréPublication High-Reliability Computing For The Smarter Planet(2011-06-01) Quinn, Heather; Manuzzato, Andrea; Graham, Paul; DeHon, André; Carter, NicholasAs computer automation continues to increase in our society, the need for greater radiation reliability is necessary. Already critical infrastructure is failing too frequently. In this paper, we will introduce the Cross-Layer Reliability concept for designing more reliable computer systems.Publication Optimistic Parallelization of Floating-Point Accumulation(2007-06-25) Kapre, Nachiket; DeHon, AndréFloating-point arithmetic is notoriously non-associative due to the limited precision representation which demands intermediate values be rounded to fit in the available precision. The resulting cyclic dependency in floating-point accumulation inhibits parallelization of the computation, including efficient use of pipelining. In practice, however, we observe that floating-point operations are mostly associative. This observation can be exploited to parallelize floating-point accumulation using a form of optimistic concurrency. In this scheme, we first compute an optimistic associative approximation to the sum and then relax the computation by iteratively propagating errors until the correct sum is obtained. We map this computation to a network of 16 statically-scheduled, pipelined, double-precision floating-point adders on the Virtex-4 LX160 (-12) device where each floating-point adder runs at 296MHz and has a pipeline depth of 10. On this 16 PE design, we demonstrate an average speedup of 6× with randomly generated data and 3-7× with summations extracted from Conjugate Gradient benchmarks.