Abstract:
A general-purpose graphics processing unit is described. The graphics processing unit includes a streaming multiprocessor having a single instruction, multiple thread (SIMT) architecture including hardware multithreading, wherein the streaming multiprocessor comprises a first processing block including a first processing core having a first floating-point data path and a second processing core having a first integer data path, the first integer data path independent of the first floating-point data path, wherein the first integer data path is to enable execution of a first instruction and the first floating-point data path is to enable execution of a second instruction, the first instruction to be executed concurrently with the second instruction; a second processing block including a third processing core having a second floating-point data path and a fourth processing core having a second integer data path, the second integer data path independent of the second floating-point data path, wherein the second integer data path is to enable execution of a third instruction and the second floating-point data path is to enable execution of a fourth instruction, the third instruction to be executed concurrently with the fourth instruction; and a memory coupled with the first processing block and the second processing block.
Abstract:
In an example, an apparatus comprises a compute engine comprising a high precision component and a low precision component; and logic, at least partially including hardware logic, to receive instructions in the compute engine; select at least one of the high precision component or the low precision component to execute the instructions; and apply a gate to at least one of the high precision component or the low precision component to execute the instructions. Other embodiments are also disclosed and claimed.
Abstract:
Generally, this disclosure provides systems, devices, methods and computer readable media for implementing function callback requests between a first processor (e.g., a GPU) and a second processor (e.g., a CPU). The system may include a shared virtual memory (SVM) coupled to the first and second processors, the SVM configured to store at least one double-ended queue (Deque). An execution unit (EU) of the first processor may be associated with a first of the Deques and configured to push the callback requests to that first Deque. A request handler thread executing on the second processor may be configured to: pop one of the callback requests from the first Deque; execute a function specified by the popped callback request; and generate a completion signal to the EU in response to completion of the function.
Abstract:
A method and apparatus to facilitate shared pointers in a heterogeneous platform. In one embodiment of the invention, the heterogeneous or non-homogeneous platform includes, but is not limited to, a central processing core or unit, a graphics processing core or unit, a digital signal processor, an interface module, and any other form of processing cores. The heterogeneous platform has logic to facilitate sharing of pointers to a location of a memory shared by the CPU and the GPU. By sharing pointers in the heterogeneous platform, the data or information sharing between different cores in the heterogeneous platform can be simplified.
Abstract:
Thread synchronization with lock inflation methods and apparatus for managed run-time environments are disclosed. An example method disclosed herein comprises determining a locking operation to perform on a lock corresponding to the object, performing an optimistically balanced synchronization of the lock if the locking operation is not unbalanced, and modifying a lock shape of the lock if the locking operation is unbalanced.
Abstract:
Thread synchronization methods and apparatus for managed run-time environments are disclosed. An example method disclosed herein comprises determining a set of locking operations to perform on a lock corresponding to an object, performing an initial locking operation comprising at least one of a balanced synchronization of the lock and an optimistically balanced synchronization of the lock if the initial locking operation is not unbalanced, and, if the initial locking operation is active and comprises the optimistically balanced synchronization, further comprising modifying a state of a pending optimistically balanced release corresponding to the optimistically balanced synchronization if a subsequent locking operation is unbalanced.
Abstract:
One embodiment provides for a processing unit comprising fetch and decode circuitry to fetch and decode a floating-point multiply-accumulate instruction; and execution circuitry to execute the floating-point multiply-accumulate instruction. The execution circuitry comprises mantissa multiplication circuitry, wherein the mantissa multiplication circuitry is shared with an integer datapath of the execution circuitry, wherein responsive to the floating-point multiply-accumulate instruction, the mantissa multiplication circuitry is to perform a multiplication operation with a mantissa value of each 16-bit floating-point data element of a first plurality of 16-bit floating-point data elements and a mantissa value of a corresponding 16-bit floating-point data element of a second plurality of 16-bit floating-point data elements to generate a corresponding plurality of mantissa results; exponent processing circuitry, responsive to the floating-point multiply-accumulate instruction, to perform an operation with an exponent value of each 16-bit floating-point data element of the first plurality of 16-bit floating-point data elements and an exponent value of each corresponding 16-bit floating-point data element of the second plurality of 16-bit floating-point data elements to generate a corresponding plurality of exponent results; circuitry to process the plurality of mantissa results and the plurality of exponent results to generate a corresponding floating-point product; and adder circuitry to generate a plurality of result floating-point values, each result floating-point value comprising a sum of one or more floating-point products of the plurality of floating-point products and a corresponding accumulated floating-point value of a plurality of accumulated floating-point values.
Abstract:
One embodiment provides for a machine-learning hardware accelerator comprising a compute unit having an adder and a multiplier that are shared between integer data path and a floating-point datapath, the upper bits of input operands to the multiplier to be gated during floating-point operation.