Abstract:
Embodiments are generally directed to compression in machine learning and deep learning processing. An embodiment of an apparatus for compression of untyped data includes a graphical processing unit (GPU) including a data compression pipeline, the data compression pipeline including a data port coupled with one or more shader cores, wherein the data port is to allow transfer of untyped data without format conversion, and a 3D compression/decompression unit to provide for compression of untyped data to be stored to a memory subsystem and decompression of untyped data from the memory subsystem.
Abstract:
Embodiments are generally directed to compression in machine learning and deep learning processing. An embodiment of an apparatus for compression of untyped data includes a graphical processing unit (GPU) including a data compression pipeline, the data compression pipeline including a data port coupled with one or more shader cores, wherein the data port is to allow transfer of untyped data without format conversion, and a 3D compression/decompression unit to provide for compression of untyped data to be stored to a memory subsystem and decompression of untyped data from the memory subsystem.
Abstract:
Embodiments are generally directed to compression in machine learning and deep learning processing. An embodiment of an apparatus for compression of untyped data includes a graphical processing unit (GPU) including a data compression pipeline, the data compression pipeline including a data port coupled with one or more shader cores, wherein the data port is to allow transfer of untyped data without format conversion, and a 3D compression/decompression unit to provide for compression of untyped data to be stored to a memory subsystem and decompression of untyped data from the memory subsystem.
Abstract:
Embodiments are generally directed to compression in machine learning and deep learning processing. An embodiment of an apparatus for compression of untyped data includes a graphical processing unit (GPU) including a data compression pipeline, the data compression pipeline including a data port coupled with one or more shader cores, wherein the data port is to allow transfer of untyped data without format conversion, and a 3D compression/decompression unit to provide for compression of untyped data to be stored to a memory subsystem and decompression of untyped data from the memory subsystem.
Abstract:
Embodiments are generally directed to compression in machine learning and deep learning processing. An embodiment of an apparatus for compression of untyped data includes a graphical processing unit (GPU) including a data compression pipeline, the data compression pipeline including a data port coupled with one or more shader cores, wherein the data port is to allow transfer of untyped data without format conversion, and a 3D compression/decompression unit to provide for compression of untyped data to be stored to a memory subsystem and decompression of untyped data from the memory subsystem.
Abstract:
Technologies for dynamic acceleration of general-purpose code include a computing device having a general-purpose processor core and one or more hardware accelerators. The computing device identifies an acceleration candidate in an application that is targeted to the processor core. The acceleration candidate may be a long-running computation of the application. The computing device translates the acceleration candidate into a translated executable targeted to the hardware accelerator. The computing device determines whether to offload execution of the acceleration candidate and, if so, executes the translated executable with the hardware accelerator. The computing device may translate the acceleration candidate into multiple translated executables, each targeted to a different hardware accelerator. The computing device may select among the translated executables in response to determining to offload execution. The hardware accelerators may include, for example, a processor graphics, an image signal processor, or a field-programmable gate array. Other embodiments are described and claimed.
Abstract:
An inter-architecture compatibility apparatus of an aspect includes a control flow transfer reception module to receive a first call procedure operation, intended for a first architecture library module, from a first architecture code module. The first call procedure operation involves a first plurality of input parameters. An application binary interface (ABI) change module is coupled with the control flow transfer reception module. The ABI change module makes ABI changes to convert the first call procedure operation involving the first plurality of input parameters to a corresponding second call procedure operation involving a second plurality of input parameters. The second call procedure operation is compatible with a second architecture library module. A control flow transfer output module is coupled with the ABI change module. The control flow transfer output module provides the second call procedure operation to the second architecture library module.
Abstract:
The present disclosure is directed to systems and methods for decomposing systolic array circuitry to provide a plurality of N×N systolic sub-array circuits, apportioning a first tensor or array into a plurality of N×M first input arrays, and apportioning a second tensor or array into a plurality of M×N second input arrays. Systolic array control circuitry transfers corresponding ones of the first input arrays and second input arrays to a respective one of the plurality of N×N systolic sub-array circuits. As the elements included in the first input array and the elements included in the second input array are transferred to the systolic sub-array, the systolic sub-array performs one or more mathematical operations using the first and the second input arrays. The systems and methods beneficially improve the usage of the systolic array circuitry thereby advantageously reducing the number of clock cycles needed to perform a given number of calculations.
Abstract:
Systems, apparatuses, and methods for a hardware and software system to automatically decompose a program into multiple parallel threads are described. In some embodiments, the systems and apparatuses execute a method of original code decomposition and/or generated thread execution.
Abstract:
An inter-architecture compatibility apparatus of an aspect includes a control flow transfer reception module to receive a first call procedure operation, intended for a first architecture library module, from a first architecture code module. The first call procedure operation involves a first plurality of input parameters. An application binary interface (ABI) change module is coupled with the control flow transfer reception module. The ABI change module makes ABI changes to convert the first call procedure operation involving the first plurality of input parameters to a corresponding second call procedure operation involving a second plurality of input parameters. The second call procedure operation is compatible with a second architecture library module. A control flow transfer output module is coupled with the ABI change module. The control flow transfer output module provides the second call procedure operation to the second architecture library module.