Abstract:
Disclosed are methods and apparatuses for preventing memory violations. In an aspect, a fetch unit accesses, from a branch predictor of a processor, a disambiguation indicator associated with a block of instructions of a program to be executed by the processor, and fetches, from an instruction cache, the block of instructions. The processor executes load instructions and/or store instructions in the block of instructions based on the disambiguation indicator indicating whether or not the load instructions and/or the store instructions in the block of instructions can bypass other instructions of the program or be bypassed by other instructions of the program.
Abstract:
Systems and methods for selective refresh of a cache, such as a last-level cache implemented as an embedded DRAM (eDRAM). A refresh bit and a reuse bit are associated with each way of at least one set of the cache. A least recently used (LRU) stack tracks positions of the ways, with positions towards a most recently used position of a threshold comprising more recently used positions and positions towards a least recently used position of the threshold comprise less recently used positions. A line in a way is selectively refreshed if the position of the way is one of the more recently used positions and if the refresh bit associated with the way is set, or the position of the way is one of the less recently used positions and if the refresh bit and the reuse bit associated with the way are both set.
Abstract:
A proposed prefetcher may operate at a cache level where accesses are conducted using physical addresses. The proposed prefetcher may include one or more prefetch engines. Similar to conventional prefetchers, a prefetch engines of the proposed prefetcher may train on access patterns of a memory page to predict future accesses and perform prefetches based on the training. But unlike the conventional prefetchers, the trained prefetch engine may be reused for prefetching even when a request for a new page is received without requiring the prefetch engine to be newly trained on the new page. This can lower access latencies and lower cumulative training time.
Abstract:
An apparatus for fan out of a result of a first instruction can include first through fourth sets of memory cells and circuitry. The first set can be configured to store the result of the first instruction. The second set can be configured to store an operation code of a second instruction. The third set can be configured to store information of the second instruction. The fourth set can be configured to store an operand for the second instruction. The circuitry can be configured to connect the fourth set to an execution unit and to cause, in response to a presence of the information in the third set, the execution unit to be configured to receive a content of the first set as the operand for the second instruction. A format of the second instruction can include a sets of bits designated for the operation code and for the information.
Abstract:
The various aspects provide a dynamic compilation framework that includes a machine-independent optimization module operating on a computing device and methods for optimizing code with the machine-independent optimization module using a single, combined-forwards-backwards pass of the code. In the various aspects, the machine-independent optimization module may generate a graph of nodes from the IR, optimize nodes in the graph using forwards and backwards optimizations, and propagating the forwards and backwards optimizations to nodes in a bounded subgraph recognized or defined based on the position of the node currently being optimized. In the various aspects, the machine-independent optimization module may optimize the graph by performing forwards and/or backwards optimizations during a single pass through the graph, thereby achieving an effective degree of optimization and shorter overall compile times. Thus, the various aspects may provide a global optimization framework for dynamic compilers that is faster and more efficient than existing solutions.
Abstract:
Systems and methods are directed to efficient management of processor resources, particularly General Purpose Registers (GPRs), for example to minimize pipeline flushes prevent deadlocks by counting GPRs instead of allocating them to specific blocks of code. Blocks of code are allowed to execute if the Free GPRs count is adequate. The method contemplates counting the number of Register Writers in blocks of code which will write to GPRs which are in process of executing, and counting the GPRs which are available instead of merely allocating them to dedicated use by a block of code, or an instruction in a block of code. Because blocks do not run if there is not enough GPRs available for the block, deadlocks and pipeline flushes due to lack of resources can be minimized.
Abstract:
The disclosure relates to processing in-flight blocks in a processor pipeline according to an expected execution mode to reduce synchronization delays that could otherwise arise due to transitions among processor modes with varying privilege levels (e.g., user mode, supervisor mode, hypervisor mode, etc.). More particularly, a program counter associated with an instruction block to be fetched may be translated to one or more execute permissions associated with the instruction block and the instruction block may be associated with a speculative execution mode based at least in part on the one or more execute permissions. Accordingly, the instruction block may be processed relative to the speculative execution mode while in-flight within the processor pipeline.
Abstract:
Method and apparatus for cache way prediction using a plurality of partial tags are provided. In a cache-block address comprising a plurality of sets and a plurality of ways or lines, one of the sets is selected for indexing, and a plurality of distinct partial tags are identified for the selected set. A determination is made as to whether a partial tag for a new line collides with any of the partial tags for current resident lines in the selected set. If the partial tag for the new line does not collide with any of the partial tags for the current resident lines, then there is no aliasing. If the partial tag for the new line collides with any of the partial tags for the current resident lines, then aliasing may be avoided by reading the full tag array and updating the partial tags.
Abstract:
An apparatus for mapping an architectural register to a physical register can include a memory and control circuitry. The memory can be configured to store an intra-core register rename map and an inter-core register rename map. The intra-core register rename map can be configured to map the architectural register to the physical register of a core of a multi-core processor. The inter-core register rename map can be configured to relate the architectural register to an identification of the first core in response to determining that the physical register is a location of a most recent write to the architectural register that has been executed by the first core, is executing on the first core, or is expected to execute on the first core, the most recent write according to program order. The control circuitry can be configured to maintain the intra-core register rename map and the inter-core register rename map.