Technical Reports (CIS)

Document Type

Technical Report

Date of this Version

January 2002

Comments

University of Pennsylvania Department of Computer and Information Science Technical Report No. MS-CIS-02-23.

Abstract

Pre-execution attacks cache misses for which conventional address-prediction driven prefetching is ineffective. In pre-execution, copies of cache miss computations are isolated from the main program and launched as separate threads called p-threads whenever the processor anticipates an upcoming miss. P-thread selection is the task of deciding what computations should execute on p-threads and when they should be launched such that total execution time is minimized. P-thread selection is central to the success of pre-execution.

We introduce a framework for automated static p-thread selection, a static p-thread being one whose dynamic instances are repeatedly launched during the course of program execution. Our approach is to formalize the problem quantitatively and then apply standard techniques to solve it analytically. The framework has two novel components. The slice tree is a new data structure that compactly represents the space of all possible static p-threads. Aggregate advantage is a formula that uses raw program statistics and computation structure to assign each candidate static p-thread a numeric score based on estimated latency tolerance and overhead aggregated over its expected dynamic executions. Our framework finds the set of p-threads whose aggregate advantages sum to a maximum. The framework is simple and intuitively parameterized to model the salient microarchitecture features.

We apply our framework to the task of choosing p-threads that cover L2 cache misses. Using detailed simulation, we study the effectiveness of our framework, and pre-execution in general, under difference conditions. We measure the effect of constraining p-thread length, of adding localized optimization to p-threads, and of using various program samples as a statistical basis for the p-thread selection, and show that our framework responds to these changes in an intuitive way. In the microarchitecture dimension, we measure the effect of varying memory latency and processor width and observe that our framework adapts well to these changes. Each experiment includes a validation component which checks that the formal model presented to our framework correctly represents actual execution.

Share

COinS
 

Date Posted: 20 June 2007