Abstract:
A mechanism is described for facilitating fast data operations for machine learning at autonomous machines. A method of embodiments, as described herein, includes detecting input data to be used in computational tasks by a computation component of a compute pipeline of a processor including a graphics processor. The method may further include determining one or more frequently-used data values (FDVs) from the data, and pushing the one or more frequent data values to bypass the computational tasks.
Abstract:
The present disclosure provides an apparatus comprising an interconnect fabric comprising one or more switches, a memory interface coupled to the interconnect fabric, an input/output (10) interface coupled to the interconnect fabric an array of processing clusters coupled to the interconnect fabric, the array of processing clusters to process instructions at variable precisions. At least one processing cluster comprising a plurality of registers to store source operands at variable precisions and an execution unit comprising a plurality of arithmetic logic units (ALUs) to execute one or more of the instructions to perform a mixed-precision fused multiply-accumulate (FMAC) operation of D = A ∗ B + C. Each source operand A, B, and C may be any of FP64, FP32, FP16, INT32, INT16, INT8 or INT4. An ALU is to generate the result operand D by multiplying source operand A with source operand B to generate an intermediate product, and adding the intermediate product to source operand C.
Abstract:
In an example, an apparatus comprises a plurality of execution units comprising at least a first type of execution unit and a second type of execution unit and logic, at least partially including hardware logic, to analyze a workload and assign the workload to one of the first type of execution unit or the second type of execution unit. Other embodiments are also disclosed and claimed.
Abstract:
In an example, an apparatus comprises a compute engine comprising a high precision component and a low precision component; and logic, at least partially including hardware logic, to receive instructions in the compute engine; select at least one of the high precision component or the low precision component to execute the instructions; and apply a gate to at least one of the high precision component or the low precision component to execute the instructions. Other embodiments are also disclosed and claimed.
Abstract:
In an example, an apparatus comprises a compute engine comprising a high precision component and a low precision component; and logic, at least partially including hardware logic, to receive instructions in the compute engine; select at least one of the high precision component or the low precision component to execute the instructions; and apply a gate to at least one of the high precision component or the low precision component to execute the instructions. Other embodiments are also disclosed and claimed.
Abstract:
A mechanism is described for facilitating fast data operations for machine learning at autonomous machines. A method of embodiments, as described herein, includes detecting input data to be used in computational tasks by a computation component of a compute pipeline of a processor including a graphics processor. The method may further include determining one or more frequently-used data values (FDVs) from the data, and pushing the one or more frequent data values to bypass the computational tasks.
Abstract:
A semiconductor chip is described that includes an instruction execution unit having a functional unit, said functional unit having minimum and maximum comparison circuitry followed by interleaving circuitry, said minimum and maximum comparison circuitry to respectively identify minimums and maximums of same positioned elements from two different sets of sorted elements, said interleaving circuitry to interleave said minimums and maximums to help form a third sorted set composed of elements from said different sets and being larger than each of said different sets.
Abstract:
A mechanism is described for facilitating fast data operations for machine learning at autonomous machines. A method of embodiments, as described herein, includes detecting input data to be used in computational tasks by a computation component of a compute pipeline of a processor including a graphics processor. The method may further include determining one or more frequently-used data values (FDVs) from the data, and pushing the one or more frequent data values to bypass the computational tasks.
Abstract:
One embodiment provides for a compute apparatus to perform machine learning operations, the compute apparatus comprising a decode unit to decode a single instruction into a decoded instruction, the decoded instruction to cause the compute apparatus to perform a complex machine learning compute operation.