摘要:
In one embodiment, a method is contemplated. Access to a hardware accelerator is requested by a user-privileged thread. Access to the hardware accelerator is granted to the user-privileged thread by a higher-privileged thread responsive to the requesting. One or more commands are communicated to the hardware accelerator by the user-privileged thread without intervention by higher-privileged threads and responsive to the grant of access. The one or more commands cause the hardware accelerator to perform one or more tasks. Computer readable media comprises instructions which, when executed, implement portions of the method are also contemplated in various embodiments, as is a hardware accelerator and a processor coupled to the hardware accelerator.
摘要:
In one embodiment, a method is contemplated. Access to a hardware accelerator is requested by a user-privileged thread. Access to the hardware accelerator is granted to the user-privileged thread by a higher-privileged thread responsive to the requesting. One or more commands are communicated to the hardware accelerator by the user-privileged thread without intervention by higher-privileged threads and responsive to the grant of access. The one or more commands cause the hardware accelerator to perform one or more tasks. Computer readable media comprises instructions which, when executed, implement portions of the method are also contemplated in various embodiments, as is a hardware accelerator and a processor coupled to the hardware accelerator.
摘要:
In one embodiment, a processor comprises execution circuitry and a translation lookaside buffer (TLB) coupled to the execution circuitry. The execution circuitry is configured to execute a store instruction having a data operand; and the execution circuitry is configured to generate a virtual address as part of executing the store instruction. The TLB is coupled to receive the virtual address and configured to translate the virtual address to a first physical address. Additionally, the TLB is coupled to receive the data operand and to translate the data operand to a second physical address. A hardware accelerator is also contemplated in various embodiments, as is a processor coupled to the hardware accelerator, a method, and a computer readable medium storing instruction which, when executed, implement a portion of the method.
摘要:
In one embodiment, a processor comprises execution circuitry and a translation lookaside buffer (TLB) coupled to the execution circuitry. The execution circuitry is configured to execute a store instruction having a data operand; and the execution circuitry is configured to generate a virtual address as part of executing the store instruction. The TLB is coupled to receive the virtual address and configured to translate the virtual address to a first physical address. Additionally, the TLB is coupled to receive the data operand and to translate the data operand to a second physical address. A hardware accelerator is also contemplated in various embodiments, as is a processor coupled to the hardware accelerator, a method, and a computer readable medium storing instruction which, when executed, implement a portion of the method.
摘要:
Maintaining a cache of indications of exclusively-owned coherence state for memory space units (e.g., cache line) allows reduction, if not elimination, of delay from missing store operations. In addition, the indications are maintained without corresponding data of the memory space unit, thus allowing representation of a large memory space with a relatively small missing store operation accelerator. With the missing store operation accelerator, a store operation, which misses in low-latency memory (e.g., L1 or L2 cache), proceeds as if the targeted memory space unit resides in the low-latency memory, if indicated in the missing store operation accelerator. When a store operation misses in low-latency memory and hits in the accelerator, a positive acknowledgement is transmitted to the writing processing unit allowing the store operation to proceed. An entry is allocated for the store operation, the store data is written into the allocated entry, and the target of the store operation is requested from memory. When a copy of the data at the requested memory space unit returns, the rest of the allocated entry is updated.
摘要:
RAW aliasing can be predicted with register bypassing based at least in part on execution displacement alias prediction. Repeated aliasing between read and write operations (e.g., within a loop), can be reliably predicted based on displacement between the aliasing operations. Performing register bypassing for predicted to alias operations facilitates faster RAW bypassing and mitigates the performance impact of aliasing read operations. The repeated aliasing between operations is tracked along with register information of the aliasing write operations. After exceeding a confidence threshold, an instance of a read operation is predicted to alias with an instance of a write operation in accordance with the previously observed repeated aliasing. Based on displacement between the instances of the operations, the register information of the write operation instance is used to bypass data to the read operation instance.
摘要:
One embodiment of the present invention provides a system that improves the effectiveness of prefetching during execution of instructions in scout mode. During operation, the system executes program instructions in a normal-execution mode. Upon encountering a condition which causes the processor to enter scout mode, the system performs a checkpoint and commences execution of instructions in scout mode, wherein the instructions are speculatively executed to prefetch future memory operations, but wherein results are not committed to the architectural state of a processor. During execution of a load instruction during scout mode, if the load instruction is a special load instruction and if the load instruction causes a lower-level cache miss, the system waits for data to be returned from a higher-level cache before resuming execution of subsequent instructions in scout mode, instead of disregarding the result of the load instruction and immediately resuming execution in scout mode. In this way, the data returned from the higher-level cache can help in generating addresses for subsequent prefetches during scout mode.
摘要:
In accordance with one embodiment, an enhanced chip multiprocessor permits an L1 cache to request ownership of a data line from a shared L2 cache. A determination is made whether to deny or grant the request for ownership based on the sharing of the data line. In one embodiment, the sharing of the data line is determined from an enhanced L2 cache directory entry associated with the data line. If ownership of the data line is granted, the current data line is passed from the shared L2 to the requesting L1 cache and an associated enhanced L1 cache directory entry and the enhanced L2 cache directory entry are updated to reflect the L1 cache ownership of the data line. Consequently, updates of the data line by the L1 cache do not go through the shared L2 cache, thus reducing transaction pressure on the shared L2 cache.
摘要:
The storage of data line in one or more L1 caches and/or a shared L2 cache of a chip multiprocessor is dynamically optimized based on the sharing of the data line. In one embodiment, an enhanced L2 cache directory entry associated with the data line is generated in an L2 cache directory of the shared L2 cache. The enhanced L2 cache directory entry includes a cache mask indicating a storage state of the data line in the one or more L1 caches and the shared L2 cache. In some embodiments, where the data line is stored in the shared L2 cache only, a portion of the cache mask indicates a storage history of the data line in the one or more L2 caches.
摘要:
One embodiment of the present invention provides a system that improves the effectiveness of prefetching during execution of instructions in scout mode. Upon encountering a non-data dependent stall condition, the system performs a checkpoint and commences execution of instructions in scout mode, wherein instructions are speculatively executed to prefetch future memory operations, but wherein results are not committed to the architectural state of a processor. When the system executes a load instruction during scout mode, if the load instruction causes a lower-level cache miss, the system allows the load instruction to access a higher-level cache. Next, the system places the load instruction and subsequent dependent instructions into a deferred queue, and resumes execution of the program in scout mode. If the load instruction ultimately causes a hit in the higher-level cache, the system replays the load instruction and subsequent dependent instructions in the deferred queue, whereby the value retrieved from the higher-level cache can help in generating prefetches during scout mode.