Abstract:
A virtually tagged cache may be configured to index virtual address entries in the cache into lockable sets based on a page offset value. When a memory operation misses on the virtually tagged cache, only the one set of virtual address entries with the same page offset may be locked. Thereafter, this general lock may be released and only an address stored in the physical tag array matching the physical address and a virtual address in the virtual tag array corresponding to the matching address stored in the physical tag array may be locked to reduce the amount and duration of locked addresses. The machine may be stalled only if a particular memory address request hits and/or tries to access one or more entries in a locked set. Devices, systems, methods, and computer readable media are provided.
Abstract:
A system and method for adaptive data prefetching in a processor enables adaptive modification of parameters associated with a prefetch operation. A stride pattern in successive addresses of a memory operation may be detected, including determining a stride length (L). Prefetching of memory operations may be based on a prefetch address determined from a base memory address, the stride length L, and a prefetch distance (D). A number of prefetch misses may be counted at a miss prefetch count (C). Based on the value of the miss prefetch count C, the prefetch distance D may be modified. As a result of adaptive modification of the prefetch distance D, an improved rate of cache hits may be realized.
Abstract:
A system and method for adaptive data prefetching in a processor enables adaptive modification of parameters associated with a prefetch operation. A stride pattern in successive addresses of a memory operation may be detected, including determining a stride length (L). Prefetching of memory operations may be based on a prefetch address determined from a base memory address, the stride length L, and a prefetch distance (D). A number of prefetch misses may be counted at a miss prefetch count (C). Based on the value of the miss prefetch count C, the prefetch distance D may be modified. As a result of adaptive modification of the prefetch distance D, an improved rate of cache hits may be realized.
Abstract:
Methods related to communication between and within nodes in a high performance computing system are presented. Processing time for message exchange between a processing unit and a network controller interface in a node is reduced. Resources required to manage application state in the network interface controller are minimized. In the network interface controller, multiple contexts are multiplexed into one physical Direct Memory Access engine. Virtual to physical address translation in the network interface controller is accelerated by using a plurality of independent caches, with each level of the page table hierarchy cached in an independent cache. A memory management scheme for data structures distributed between the processing unit and the network controller interface is provided. The state required to implement end-to-end reliability is reduced by limiting the transmit sequence number space to the currently in-flight messages.
Abstract:
A bit or other vector may be used to identify whether an address range entered into an intermediate buffer corresponds to most recently updated data associated with the address range. A bit or other vector may also be used to identify whether an address range entered into an intermediate buffer overlaps with an address range of data that is to be loaded. A processing device may then determine whether to obtain data that is to be loaded entirely from a cache, entirely from an intermediate buffer which temporarily buffers data destined for a cache until the cache is ready to accept the data, or from both the cache and the intermediate buffer depending on the particular vector settings. Systems, devices, methods, and computer readable media are provided.
Abstract:
A system and method to implement a tag structure for a cache memory that includes a multi-way, set-associative translation lookaside buffer. The tag structure may store vectors in an L1 tag array to enable an L1 tag lookup that has fewer bits per entry and consumes less power. The vectors may identify entries in a translation lookaside buffer tag array. When a virtual memory address associated with a memory access instruction hits in the translation lookaside buffer, the translation lookaside buffer may generate a vector identifying the set and the way of the translation lookaside buffer entry that matched. This vector may then be compared to a group of vectors stored in a set of the L1 tag arrays to determine whether the virtual memory address hits in the L1 cache.
Abstract:
A system and method to implement a tag structure for a cache memory that includes a multi-way, set-associative translation lookaside buffer. The tag structure may store vectors in an L1 tag array to enable an L1 tag lookup that has fewer bits per entry and consumes less power. The vectors may identify entries in a translation lookaside buffer tag array. When a virtual memory address associated with a memory access instruction hits in the translation lookaside buffer, the translation lookaside buffer may generate a vector identifying the set and the way of the translation lookaside buffer entry that matched. This vector may then be compared to a group of vectors stored in a set of the L1 tag arrays to determine whether the virtual memory address hits in the L1 cache.