Abstract:
In one embodiment of the invention, a processor comprising an upper level cache and at least one processor core. The at least one processor core includes one or more registers and a plurality of instruction processing stages: a decode unit to decode an instruction requiring an input of a plurality of data elements, wherein a size of each of the plurality of data elements is less than a cache line size of the processor; an execution unit to load the plurality of data elements to the one or more registers of the processor, without loading data elements spatially adjacent to the plurality of data elements or the plurality of data elements in an upper level cache.
Abstract:
In an embodiment, a processor includes one or more cores including a first core operable at an operating voltage between a minimum operating voltage and a maximum operating voltage. The processor also includes a power control unit including first logic to enable coupling of ancillary logic to the first core responsive to the operating voltage being less than or equal to a threshold voltage, and to disable the coupling of the ancillary logic to the first core responsive to the operating voltage being greater than the threshold voltage. Other embodiments are described and claimed.
Abstract:
Systems, apparatuses, and methods for a hardware and software system to automatically decompose a program into multiple parallel threads are described. In some embodiments, the systems and apparatuses execute a method of original code decomposition and/or generated thread execution.
Abstract:
An embodiment of a semiconductor package apparatus may include technology to identify a nested loop in a set of executable instructions, and determine at runtime if the nested loop is a candidate for cache blocking. Other embodiments are disclosed and claimed.
Abstract:
Techniques are disclosed to identify a frequently-executed region of code during runtime execution of the code, generate initial profiling code for the frequently-executed region of code, cause the initial profiling code to be executed for a minimum number of processing cycles of the computer, and identify replacement candidate store instruction(s) that store a value that is not read by the frequently-executed region of code during execution of the initial profiling code. Replacement candidate load instruction(s) may also be identified that load a value that is not stored or loaded by the frequently-executed region of code during execution of the initial profiling code. Optimized code for the frequently-executed region of code may be generated by replacing each of the replacement candidate store or load instructions(s) with a non-temporal store or load instruction. The optimized code may be executed instead of the frequently-executed region of code during subsequent runtime execution.
Abstract:
A processor includes a processor core and a cache controller coupled to the processor core. The cache controller is to allocate, for a memory, a plurality of cache entries in a cache, wherein the processor core is to: detect an amount of the memory installed in a computing system and, responsive to detecting less than a maximum allowable amount of the memory for the computing system, direct the cache controller to increase a number of ways of the cache in which to allocate the plurality of cache entries.
Abstract:
In an embodiment, a processor includes one or more cores including a first core operable at an operating voltage between a minimum operating voltage and a maximum operating voltage. The processor also includes a power control unit including first logic to enable coupling of ancillary logic to the first core responsive to the operating voltage being less than or equal to a threshold voltage, and to disable the coupling of the ancillary logic to the first core responsive to the operating voltage being greater than the threshold voltage. Other embodiments are described and claimed.