Abstract:
A combined function specified by an instruction is performed. The combined function includes a plurality of operations performed as part of one invocation of the combined function. The performing the combined function includes performing a matrix multiplication of a first tensor and a second tensor to obtain one or more intermediate results. The second tensor includes an adjusted weight tensor created using a multiplier. Values of a bias tensor are added to the one or more intermediate results to obtain one or more results for the combined function. The one or more results are at least a part of an output tensor.
Abstract:
A method is provided for forming a Deep Neural Network (DNN). The method includes quantizing deep learning data structures of the DNN into at least two modes using at least two scale factors, respectively. Each of the at least two modes corresponds to a respective one of the at least two scale factors. The method further includes identifying which of the at least two scale factors to use for a given one of the data structures based on a data distribution of the given one of the data structures. The quantizing step includes identifying when a tail of the given one of the data structures starts by (i) building a histogram of values in the given one of the data structures using successive bins; (ii) identifying a ratio of density between the successive bins; and (iii) checking whether the ratio of density is greater than a ratio of density threshold.
Abstract:
A processor and a method implemented by the processor to obtain computation results are described. The processor includes a unified reuse table embedded in a processor pipeline, the unified reuse table including a plurality of entries, each entry of the plurality of entries corresponding with a computation instruction or a set of computation instructions. The processor also includes a functional unit to perform a computation based on a corresponding instruction.
Abstract:
A processor and a method implemented by the processor to obtain computation results are described. The processor includes a unified reuse table embedded in a processor pipeline, the unified reuse table including a plurality of entries, each entry of the plurality of entries corresponding with a computation instruction or a set of computation instructions. The processor also includes a functional unit to perform a computation based on a corresponding instruction.
Abstract:
Cache miss rates for threads operating in a simultaneous multi-threading computer processing environment can be estimated. The single thread rates can be estimated by monitoring a shared directory for cache misses for a first thread. Memory access requests can be routed to metering cache directories associated with the particular thread. Single thread misses to the shared directory and single thread misses to the associated metering cache directory are monitored and a performance indication is determined by comparing the cache misses with the thread misses. The directory in the associated metering cache is rotated, and a second sharing performance indication is determined.
Abstract:
Embodiments relate to a prefetch threshold for cache restoration. An aspect includes determining, based on a task switch from an outgoing task to a current task in a processor, a prefetch threshold for a next task, the prefetch threshold corresponding to an expected runtime of the current task and an amount of time required to prefetch data for the next task. Another aspect includes starting prefetching for the next task while the current task is executing based on the prefetch threshold.
Abstract:
Embodiments relate to thread-based cache content savings for task switching in a computer processor. An aspect includes determining a cache entry in a cache of the computer processor that is owned by the first thread, wherein the determination is made based on a hardware thread identifier (ID) of the first thread matching a hardware thread ID in the cache entry. Another aspect includes determining whether the determined cache entry is eligible for prefetching. Yet another aspect includes, based on determining that the determined cache entry is eligible for prefetching, setting a marker in the cache entry to active.
Abstract:
Cache miss rates for threads operating in a simultaneous multi-threading computer processing environment can be estimated. The single thread rates can be estimated by monitoring a shared directory for cache misses for a first thread. Memory access requests can be routed to metering cache directories associated with the particular thread. Single thread misses to the shared directory and single thread misses to the associated metering cache directory are monitored and a performance indication is determined by comparing the cache misses with the thread misses. The directory in the associated metering cache is rotated, and a second sharing performance indication is determined.
Abstract:
Systems and methods to manage memory on a spin transfer torque magnetoresistive random-access memory (STT-MRAM) are provided. A particular method may include determining a performance characteristic using relationship information that relates a bit error rate to at least one of a programming pulse width, a temperature, a history-based predictive performance parameter , a coding scheme, and a voltage level also associated with a memory. The performance characteristic is stored and used to manage a write operation associated with the memory.
Abstract:
Systems and methods to manage memory on a spin transfer torque magnetoresistive random-access memory (STT-MRAM) are provided. A particular method may include determining a performance characteristic using relationship information that relates a bit error rate to at least one of a programming pulse width, a temperature, a history-based predictive performance parameter , a coding scheme, and a voltage level also associated with a memory. The performance characteristic is stored and used to manage a write operation associated with the memory.