A High-Bandwidth Load-Store Unit for Single- and Multi-Threaded Processors

Roth, Amir

A High-Bandwidth Load-Store Unit for Single- and Multi-Threaded Processors

Files

MS_CIS_04_09.pdf (182.65 KB)

Penn collection

Technical Reports (CIS)

Subject

computer science
processors

Permalink

https://repository.upenn.edu/handle/20.500.14332/6973

View all metadata

Author

Roth, Amir

Abstract

A store queue (SQ) is a critical component of the load execution machinery. High ILP processors require high load execution bandwidth, but providing high bandwidth SQ access is difficult. Address banking, which works well for caches, conflicts with age-ordering which is required for the SQ and multi-porting exacerbates the latency of the associative searches that load execution requires. In this paper, we present a new high-bandwidth load-store unit design that exploits the predictability of forwarding behavior. To start with, a simple predictor filters loads that are not likely to require forwarding from accessing the SQ enabling a reduction in the number of associative ports. A subset of the loads that do not access the SQ are re-executed prior to retirement to detect over-aggressive filtering and train the predictor. A novel adaptation of a Bloom filter keeps the re-execution subset minimal. Next, the same predictor filters stores that don't forward values to nearby loads from the SQ enabling a substantial capacity reduction. To enable this optimization and maintain in-order store retirement, we add a second SQ that contains all stores, but only to retirement and Bloom filter management; this queue is large but isn’t associatively searched. Finally, to boost both load and store filtering and to handle programs with heavy forwarding bandwidth requirements we add a second, address-banked forwarding structure that handles "easy" forwarding instances, leaving the globally-ordered SQ to handle only "tricky" cases. Our design does not directly address load queue scalability, but does dovetail with a recent proposal that also uses re-execution to tackle this issue. Performance simulations on SPEC2000 and MediaBench benchmarks show that our design comes within 2% (7% in the worst case) of the performance of an ideal multi-ported SQ, using only a 16-entry queue with a single associative lookup port.

Publication date

2004-01-01

Comments

University of Pennsylvania Department of Computer and Information Science Technical Report No. MS-CIS-04-09.

Collection

Reports