Roth, Amir

Email Address
Research Projects
Organizational Units
Research Interests

Search Results

Now showing 1 - 10 of 16
  • Publication
    RENO: A Rename-Based Instruction Optimizer
    (2004-12-09) Petric, Vlad; Sha, Tingting; Roth, Amir
    The effectiveness of static code optimizations--including static optimizations performed "just-in-time"--is limited by some basic constraints: (i) a limited number of logical registers, (ii) a function- or region-bounded optimization scope, and (iii) the requirement that transformations be valid along all possible paths. RENO is a modified MIPS-R10000 style register renaming mechanism augmented with physical register reference counting that uses map-table "short-circuiting" to implement dynamic versions of several well-known static optimizations: move elimination, common subexpression elimination, register allocation, and constant folding. Because it implements these optimizations dynamically, RENO can overcome some of the limitations faced by static compilers and apply optimizations where static compilers cannot. RENO has many more registers at its disposal--the entire physical register file. Its optimizations naturally cross function or any other compilation region boundary. And RENO performs optimizations along the dynamic path without being impacted by other, non-taken paths. If the dynamic path proves incorrect due to mispeculations, RENO optimizations are naturally rolled back along with the code they optimize. RENO unifies several previously proposed optimizations: dynamic move elimination [14] (RENOME), register integration [24] (RENOCSE), and speculative memory bypassing (the dynamic counterpart of register allocation) [14, 21, 22, 24] (RENORA). To this union, we add a new optimization: RENOCF a dynamic version of constant folding. RENOCF extends the map-table from logical -- register --> [physical -- register] to logical -- register --> [physical -- register : displacement]. RENOCF uses this extended map-table format to eliminate register-immediate additions--which account for a surprisingly high fraction of the dynamic instructions in SPECint and MediaBench programs--and fuse them to dependent instructions. The most common fusion scenario is the fusion of a register-immediate addition to another addition, e.g., a memory address calculation. RENOCF implements this fusion essentially "for free" using 3-input adders. The RENO mechanism is works solely with physical register names and immediate values; it does not read or write the physical register file or use any non-immediate values for any purpose. This isolated structure allows us to implement RENO within a two-stage renaming pipeline. Cycle-level simulation shows that RENO can dynamically eliminate or fold 22% of the dynamic instructions in both SPECint2000 and MediaBench, respectively; RENOCF itself is responsible for 12% and 16%. Because dataflow dependences are collapsed around eliminated instructions, RENO improves performance by averages of 8% and 13%. Alternatively, because eliminated instructions do not consume issue queue entries, physical registers, or issue, bypass, register file, and execution bandwidth, RENO can be used to absorb the performance impact of a significantly scaled-down execution core.
  • Publication
    A Quantitative Framework for Automated Pre-Execution Thread Selection
    (2002-01-01) Roth, Amir; Sohi, Gurindar S.
    Pre-execution attacks cache misses for which conventional address-prediction driven prefetching is ineffective. In pre-execution, copies of cache miss computations are isolated from the main program and launched as separate threads called p-threads whenever the processor anticipates an upcoming miss. P-thread selection is the task of deciding what computations should execute on p-threads and when they should be launched such that total execution time is minimized. P-thread selection is central to the success of pre-execution. We introduce a framework for automated static p-thread selection, a static p-thread being one whose dynamic instances are repeatedly launched during the course of program execution. Our approach is to formalize the problem quantitatively and then apply standard techniques to solve it analytically. The framework has two novel components. The slice tree is a new data structure that compactly represents the space of all possible static p-threads. Aggregate advantage is a formula that uses raw program statistics and computation structure to assign each candidate static p-thread a numeric score based on estimated latency tolerance and overhead aggregated over its expected dynamic executions. Our framework finds the set of p-threads whose aggregate advantages sum to a maximum. The framework is simple and intuitively parameterized to model the salient microarchitecture features. We apply our framework to the task of choosing p-threads that cover L2 cache misses. Using detailed simulation, we study the effectiveness of our framework, and pre-execution in general, under difference conditions. We measure the effect of constraining p-thread length, of adding localized optimization to p-threads, and of using various program samples as a statistical basis for the p-thread selection, and show that our framework responds to these changes in an intuitive way. In the microarchitecture dimension, we measure the effect of varying memory latency and processor width and observe that our framework adapts well to these changes. Each experiment includes a validation component which checks that the formal model presented to our framework correctly represents actual execution.
  • Publication
    Encoding Mini-Graphs With Handle Prefix Outlining
    (2008-01-01) Bracy, Anne W.; Roth, Amir
    Recently proposed techniques like mini-graphs, CCA-subgraphs, and static strands exploit application-specific compound or fused instructions to reduce execution time, energy consumption, and/or processor complexity. To achieve their full potential, these techniques rely on static tools to identify common instruction sequences that make good fusion candidates. As a result, they also rely on ISA extension facilities that can encode these chosen instruction groups in a way that supports efficient execution on fusion-enabled hardware as well as compatibility across different implementations, including fusion-agnostic implementations. This paper describes handle prefix outlining, the ISA extension scheme used by mini-graph processors. Handle prefix outlining can be thought of as a hybrid of the encoding scheme used by three previous instruction aggregation techniques: PRISC, static strands, and CCA-subgraphs. It combines the best features of each scheme to deliver both full compatibility and execution efficiency on fusion-enabled processors.
  • Publication
    A High-Bandwidth Load-Store Unit for Single- and Multi-Threaded Processors
    (2004-01-01) Roth, Amir
    A store queue (SQ) is a critical component of the load execution machinery. High ILP processors require high load execution bandwidth, but providing high bandwidth SQ access is difficult. Address banking, which works well for caches, conflicts with age-ordering which is required for the SQ and multi-porting exacerbates the latency of the associative searches that load execution requires. In this paper, we present a new high-bandwidth load-store unit design that exploits the predictability of forwarding behavior. To start with, a simple predictor filters loads that are not likely to require forwarding from accessing the SQ enabling a reduction in the number of associative ports. A subset of the loads that do not access the SQ are re-executed prior to retirement to detect over-aggressive filtering and train the predictor. A novel adaptation of a Bloom filter keeps the re-execution subset minimal. Next, the same predictor filters stores that don't forward values to nearby loads from the SQ enabling a substantial capacity reduction. To enable this optimization and maintain in-order store retirement, we add a second SQ that contains all stores, but only to retirement and Bloom filter management; this queue is large but isn’t associatively searched. Finally, to boost both load and store filtering and to handle programs with heavy forwarding bandwidth requirements we add a second, address-banked forwarding structure that handles "easy" forwarding instances, leaving the globally-ordered SQ to handle only "tricky" cases. Our design does not directly address load queue scalability, but does dovetail with a recent proposal that also uses re-execution to tackle this issue. Performance simulations on SPEC2000 and MediaBench benchmarks show that our design comes within 2% (7% in the worst case) of the performance of an ideal multi-ported SQ, using only a 16-entry queue with a single associative lookup port.
  • Publication
    Three Extensions to Register Integration
    (2002-01-01) Roth, Amir; Bracy, Anne; Petric, Vlad
    Register integration (or just integration) is a register renaming discipline that implements instruction reuse via physical register sharing. Initially developed to perform squash reuse, the integration mechanism is a powerful reuse tool that can exploit more reuse scenarios. In this paper, we describe three extensions to the initial integration mechanism that expand its applicability and boost its performance impact. First, we extend squash reuse to general reuse. Whereas squash reuse maintains the superscalar concept of an instruction instance "owning" its output physical register we allow multiple instructions to simultaneously and seamlessly share a single physical register. Next, we replace the PC-indexing scheme used by squash reuse with an opcode-based indexing scheme that exposes more integration opportunities. Finally, we introduce an extension called reverse integration in which we speculatively create integration entries for the inverses of operations-for instance, when renaming an add, we create an entry for the inverse subtract. Reverse integration allows us to reuse operations that were not specified by the original program. We use reverse integration to obtain a free implementation of speculative memory bypassing for stack-pointer based loads (register fills and restores). Our evaluation shows that these extensions increase the integration rate - the number of retired instructions that integrate older results and bypass the execution engine - to an average of 17% on the SPEC2000 integer benchmarks. On a 4-way superscalar processor with an aggressive memory system, this translates into an average IPC improvement of 8%. The fact that integrating instructions completely bypass the execution engine raises the possibility of using integration as a low-complexity substitute for execution bandwidth and issue buffering. Our experiments show that such a trade-off is possible, enabling a range of IPC/complexity designs.
  • Publication
    Dataflow Mini-Graphs: Amplifying Superscalar Capacity and Bandwidth
    (2004-12-01) Bracy, Anne; Prahlad, Prashant; Roth, Amir
    A mini-graph is a dataflow graph that has an arbitrary internal size and shape but the interface of a singleton instruction: two register inputs, one register output, a maximum of one memory operation, and a maximum of one (terminal) control transfer. Previous work has exploited dataflow sub-graphs whose execution latency can be reduced via programmable FPGA-style hardware. In this paper we show that mini-graphs can improve performance by amplifying the bandwidths of a superscalar processor’s stages and the capacities of many of its structures without custom latency-reduction hardware. Amplification is achieved because the processor deals with a complete mini-graph via a single quasi-instruction, the handle. By constraining mini-graph structure and forcing handles to behave as much like singleton instructions as possible, the number and scope of the modifications over a conventional superscalar microarchitecture is kept to a minimum. This paper describes mini-graphs, a simple algorithm for extracting them from basic block frequency profiles, and a microarchitecture for exploiting them. Cycle-level simulation of several benchmark suites shows that mini-graphs can provide average performance gains of 2–12% over an aggressive baseline, with peak gains exceeding 40%. Alternatively, they can compensate for substantial reductions in register file and scheduler size, and in pipeline bandwidth.
  • Publication
    Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load/Store Optimization
    (2004-01-01) Roth, Amir
    A high-bandwidth, low-latency load-store unit is a critical component of a dynamically scheduled processor. Unfortunately, it is also one of the most complex and non-scalable components. Recently, several researchers have proposed techniques that simplify the core load-store unit and improve its scalability in exchange for the in-order pre-retirement re-execution of some subset of the loads in the program. We call such techniques load/store optimizations. One recent optimization attacks load queue (LQ) scalability by replacing the expensive associative search that is used to enforce intra- and inter- thread ordering with load re-execution. A second attacks store queue (SQ) scalability by speculatively filtering some load accesses and some store entries from it. The speculatively accessed, speculatively populated SQ can be made smaller and faster, but load re-execution is required to verify the speculation. A third uses a hardware table to identify redundant loads and skip their execution altogether. Redundant load elimination is highly accurate but not 100%, so re-execution is needed to flag false eliminations. Unfortunately, the inherent benefits of load/store optimizations are mitigated by re-execution itself. Re-execution contends for cache bandwidths with store retirement, and serializes load re-execution with subsequent store retirement. If a particular technique requires a sufficient number of load re-executions, the cost of these re-executions will outweigh the benefits of the technique entirely and may even produce drastic slowdowns. This is the case for the SQ technique. Store Vulnerability Window (SVW) is a new mechanism that reduces the re-execution requirements of a given load/store optimization significantly, by an average of 85% across the three load/store optimizations we study. This reduction relieves cache port contention and removes many of the dynamic serialization events that contribute the bulk of re-execution’s cost, and allows these techniques to perform up to their full potential. For the scalable SQ optimization, this means the chnace to perform at all. Without SVW, this technique posts significant slowdowns. SVW is a simple scheme based on monotonic store sequence numbering and a novel application of Bloom Filtering. The cost of an effective SVW implementation is a 1KB buffer and an 2B field per LQ entry.
  • Publication
    Scalable Store-Load Forwarding via Store Queue Index Prediction
    (2005-11-01) Martin, Milo; Sha, Tingting; Roth, Amir
    Conventional processors use a fully-associative store queue (SQ) to implement store-load forwarding. Associative search latency does not scale well to capacities and bandwidths required by wide-issue, large window processors. In this work, we improve SQ scalability by implementing store-load forwarding using speculative indexed access rather than associative search. Our design uses prediction to identify the single SQ entry from which each dynamic load is most likely to forward. When a load executes, it either obtains its value from the predicted SQ entry (if the address of the entry matches the load address) or the data cache (otherwise). A forwarding mis-prediction — detected by pre-commit filtered load re-execution — results in a pipeline flush. SQ index prediction is generally accurate, but for some loads it cannot reliably identify a single SQ entry. To avoid flushes on these difficult loads while keeping the single-SQ-access-per-load invariant, a second predictor delays difficult loads until all but the youngest of their "candidate" stores have committed. Our predictors are inspired by store-load dependence predictors for load scheduling (Store Sets and the Exclusive Collision Predictor) and unify load scheduling and forwarding. Experiments on the SPEC2000 and MediaBench benchmarks show that on an 8-way issue processor with a 512-entry reorder buffer, our technique performs within 3.3% of an ideal associative SQ (same latency as the data cache) and either matches or exceeds the performance of a realistic associative SQ (slower than data cache) on 31 of 47 programs.
  • Publication
    NoSQ: Store-Load Communication without a Store Queue
    (2007-02-01) Martin, Milo; Sha, Tingting; Roth, Amir
    The NoSQ microarchitecture performs store-load communication without a store queue and without executing stores in the out-of-order engine. It uses speculative memory bypassing for all in-flight store-load communication, enabled by a 99.8 percent accurate store-load communication predictor. The result is a simple, fast core data path containing no dedicated store-load forwarding structures.
  • Publication
    NoSQ: Store-Load Communication without a Store Queue
    (2006-12-01) Sha, Tingting; Martin, Milo; Roth, Amir
    This paper presents NoSQ (short for No Store Queue), a microarchitecture that performs store-load communication without a store queue and without executing stores in the out-of-order engine. NoSQ implements store-load communication using speculative memory bypassing (SMB), the dynamic short-circuiting of DEF-store-load-USE chains to DEF-USE chains. Whereas previous proposals used SMB as an opportunistic complement to conventional store queue-based forwarding, NoSQ uses SMB as a store queue replacement. NoSQ relies on two supporting mechanisms. The first is an advanced store-load bypassing predictor that for a given dynamic load can predict whether that load will bypass and the identity of the communicating store. The second is an efficient verification mechanism for both bypassed and non-bypassed loads using in-order load re-execution with an SMB-aware store vulnerability window (SVW) filter. The primary benefit of NoSQ is a simple, fast datapath that does not contain store-load forwarding hardware; all loads get their values either from the data cache or from the register file. Experiments show that this simpler design - despite being more speculative - slightly outperforms a conventional store-queue based design on most benchmarks (by 2% on average).