Abstract:
A system and method to optimize runahead operation for a processor without use of a separate explicit runahead cache structure. Rather than simply dropping store instructions in a processor runahead mode, store instructions write their results in an existing processor store queue, although store instructions are not allowed to update processor caches and system memory. Use of the store queue during runahead mode to hold store instruction results allows more recent runahead load instructions to search retired store queue entries in the store queue for matching addresses to utilize data from the retired, but still searchable, store instructions. Retired store instructions could be either runahead store instructions retired, or retired store instructions that executed before entering runahead mode.
Abstract:
Embodiments that dynamically reorganize data of cache lines in non-uniform cache access (NUCA) caches are contemplated. Various embodiments comprise a computing device, having one or more processors coupled with one or more NUCA cache elements. The NUCA cache elements may comprise one or more banks of cache memory, wherein ways of the cache are horizontally distributed across multiple banks. To improve access latency of the data by the processors, the computing devices may dynamically propagate cache lines into banks closer to the processors using the cache lines. To accomplish such dynamic reorganization, embodiments may maintain “direction” bits for cache lines. The direction bits may indicate to which processor the data should be moved. Further, embodiments may use the direction bits to make cache line movement decisions.
Abstract:
A method for estimating power dissipated by a processor core processing a workload-includes analyzing a reference test case to generate a reference workload characteristic, analyzing an actual workload to generate an actual workload characteristic, performing a power analysis for the reference test case to establish a reference power dissipation value and estimating an actual workload power dissipation value responsive to the actual and reference workload characteristics and the reference power dissipation value.
Abstract:
A physical document comprising a human-readable part and a machine-readable part, wherein the machine-readable part comprises markup that describes information on at least one of the document and data within the human-readable part.
Abstract:
A method for estimating power dissipated by processor core processing a workload includes analyzing a reference test case to generate a reference workload characteristic, analyzing an actual workload to generate an actual workload characteristic, performing a power analysis for the reference test case to establish a reference power dissipation value and estimating an actual workload power dissipation value responsive to the actual and reference workload characteristics and the reference power dissipation value.
Abstract:
A method, system, and apparatus for estimating the power dissipated by a processor core processing a workload, where the method includes analyzing a reference test case to generate a reference workload characteristic. Analyzing an actual workload to generate an actual workload characteristic. Performing a power analysis for the reference test case to establish a reference power dissipation value. Estimating an actual workload power dissipation value responsive to the actual and reference workload characteristics and the reference power dissipation value
Abstract:
A number of hypervisor register fields are set to specify which processor cores are allowed to generate a number of performance events for a particular thread group. A plurality of threads for an application running in the computing environment to a plurality of thread groups are configured by a plurality of thread group fields in a plurality of control registers. A number of counter sets are allowed to count a number of thread group events originating from one of a shared resource and a shared cache are specified by a number of additional hypervisor register fields.
Abstract:
Sampled instruction address registers are shared among multiple threads executing on a plurality of processor cores. Each of a plurality of sampled instruction address registers are assigned to a particular thread running for an application on the plurality of processor cores. Each of the sampled instruction address registers are configured by storing in each of the sampled instruction address registers a thread identification of the particular thread in a thread identification field and a processor identification of a particular processor on which the particular thread is running in a processor identification field.
Abstract:
A processing system includes a memory and a first core configured to process applications. The first core includes a first cache. The processing system includes a mechanism configured to capture a sequence of addresses of the application that miss the first cache in the first core and to place the sequence of addresses in a storage array; and a second core configured to process at least one software algorithm. The at least one software algorithm utilizes the sequence of addresses from the storage array to generate a sequence of prefetch addresses. The second core issues prefetch requests for the sequence of the prefetch addresses to the memory to obtain prefetched data and the prefetched data is provided to the first core if requested.
Abstract:
A system for controlling a multitasking microprocessor system includes an interconnect, a plurality of processing units connected to the interconnect forming a single-source, single-sink flow network, wherein the plurality of processing units pass data between one another from the single-source to the single-sink, and a monitor connected to the interconnect for monitoring a portion of a resource consumed by each of the plurality of processing units and for controlling the plurality of processing units according to a predetermined budget for the resource to control a data overflow condition, wherein the monitor controls performance and power modes of the plurality of processing units.