Abstract:
A multi-core processor includes a plurality of cores to execute a plurality of threads and to monitor metrics for each of the plurality of threads during an interval, the metrics including stall cycle values, prefetches of a first type, and prefetches of a second type. The multi-core processor further includes criticality-aware thread prioritization (CATP) logic to compute a stall fraction for each of the plurality of threads during the interval using the stall cycle values, identify a thread with a highest stall fraction of the plurality of threads, determine the highest stall fraction is greater than a stall threshold, prioritize demand requests of the identified thread, compute a prefetch accuracy of the identified thread during the interval using the prefetches of the first type and the prefetches of the second type, determine the prefetch accuracy is greater than a prefetch threshold, and prioritize prefetch requests of the identified thread.
Abstract:
Some implementations disclosed herein provide techniques for caching memory data and for managing cache retention. Different cache retention policies may be applied to different cached data streams such as those of a graphics processing unit. Actual performance of the cache with respect to the data streams may be observed, and the cache retention policies may be varied based on the observed actual performance.
Abstract:
Techniques are described for optimizing memory management in a processor system. The techniques may be implemented on processors that include on-chip performance monitoring and on systems where an external performance monitor is coupled to a processor. Processors that include a Performance Monitoring Unit (PMU) are examples. The PMU may store data on read and write cache misses, as well as data on translation lookaside buffer (TLB) misses. The data from the PMU is used to determine if any memory regions within a memory heap are delinquent memory regions, i.e., regions exhibiting high numbers of memory problems or stalls. If delinquent memory regions are found, the memory manager, such as a garbage collection routine, can efficiently optimize memory performance as well as the mutators performance by improving the layout of objects in the heap. In this way, memory management routines may be focused based on dynamic and real-time memory performance data.
Abstract:
Techniques are described for optimizing memory management in a processor system. The techniques may be implemented on processors that include on-chip performance monitoring and on systems where an external performance monitor is coupled to a processor. Processors that include a Performance Monitoring Unit (PMU) are examples. The PMU may store data on read and write cache misses, as well as data on translation lookaside buffer (TLB) misses. The data from the PMU is used to determine if any memory regions within a memory heap are delinquent memory regions, i.e., regions exhibiting high numbers of memory problems or stalls. If delinquent memory regions are found, the memory manager, such as a garbage collection routine, can efficiently optimize memory performance as well as the mutators performance by improving the layout of objects in the heap. In this way, memory management routines may be focused based on dynamic and real-time memory performance data.
Abstract:
Methods and apparatus to insert prefetch instructions based on garbage collector analysis and compiler analysis are disclosed. In an example method, one or more batches of samples associated with cache misses from a performance monitoring unit in a processor system are received. One or more samples from the one or more batches of samples based on delinquent information are selected. A performance impact indicator associated with the one or more samples is generated. Based on the performance indicator, at least one of a garbage collector analysis and a compiler analysis is initiated to identify one or more delinquent paths. Based on the at least one of the garbage collector analysis and the compiler analysis, one or more prefetch points to insert prefetch instructions are identified.
Abstract:
Apparatuses and articles of manufacture are disclosed. An example apparatus includes an activation function control and decode circuitry to populate an input buffer circuitry with an input data element bit subset of less than a threshold number of bits of the input data element retrieved from the memory circuitry. The activation function and control circuitry also populate a kernel weight buffer circuitry with a weight data element bit subset of less than the threshold number of bits of the weight data element retrieved from the memory circuitry. The apparatus also including a preprocessor circuitry to calculate a partial convolution value of at least a portion of the input data element bit subset and the weight data element bit subset to determine a predicted sign of the partial convolution value.
Abstract:
An apparatus and method are described for implementing an exclusive lower level cache (LLC) policy within a computer processor. For example, one embodiment of a computer processor comprises: a mid-level cache circuit (MLC) for storing a first set of cache lines containing instructions and/or data; a lower level cache circuit (LLC) for storing a second set of cache lines of instructions and/or data; and an insertion circuit for implementing a policy for inserting or replacing cache lines within the LLC based on values of use recency and use frequency associated with the lines.
Abstract:
An apparatus and method are described for implementing an exclusive lower level cache (LLC) policy within a computer processor. For example, one embodiment of a computer processor comprises: a mid-level cache circuit (MLC) for storing a first set of cache lines containing instructions and/or data; a lower level cache circuit (LLC) for storing a second set of cache lines of instructions and/or data; and an insertion circuit for implementing a policy for inserting or replacing cache lines within the LLC based on values of use recency and use frequency associated with the lines.
Abstract:
Methods and apparatus to dynamically insert prefetch instructions are disclosed. In an example method, one or more samples associated with cache misses are identified from a performance monitoring unit in a processor system. Based on sample information associated with the one or more samples, delinquent information is generated. To dynamically insert one or more prefetch instructions, a prefetch point is identified based on the delinquent information.
Abstract:
An improved moving garbage collection algorithm is described. The algorithm allows efficient use of non-temporal stores to reduce the required time for garbage collection. Non-temporal stores (or copies) are a CPU feature that allows the copy of data objects within main memory with no interference or pollution of the cache memory. The live objects copied to new memory locations will not be accessed again in the near future and therefore need not be copied to cache. This avoids copy operations and avoids taxing the CPU with cache determinations. In a preferred embodiment, the algorithm of the present invention exploits the fact that live data objects will be stored to consecutive new memory locations in order to perform streaming copies. Since each copy procedure has an associated CPU overhead, the process of streaming the copies reduces the degradation of system performance and thus reduces the time for garbage collection.