Abstract:
Methods and apparatus for vector-matrix comparison are disclosed. In one embodiment, a processor comprises decoding and execution circuitry. The decoding circuitry decodes an instruction, where operands of the instruction specifies an output location to store output results, a vector of data element values, and a matrix of data element values. The execution circuitry executes the decoded instruction. The execution includes to map each of the data element values of the vector to one of consecutive rows of the matrix; for each data element value of the vector, to compare that data element value of the vector with data element values in a respective row of the matrix and obtain data element match results. The execution further includes to store the output results based on the data element match results, where each output result maps to a respective data element column position and indicates a vector match result.
Abstract:
Methods and apparatus relating to techniques for fusing SIMD processing units. In an example, an apparatus comprises logic, at least partially comprising hardware logic, to receive an instruction set for execution on at least two graphics processing execution units, determine whether the instruction set requires data dependent addressing, and select between a synchronized execution environment for the at least two graphics processing units and an unsynchronized execution environment for the at least two graphics processing units based at least in part on the determination whether the instruction set requires data dependent addressing. Other embodiments are also disclosed and claimed.
Abstract:
Dynamic heterogeneous hashing function technology for balancing memory requests between multiple memory channels is described. A processor includes functional units and multiple memory channels, and a memory controller unit (MCU) coupled between them. The MCU includes a general-purpose hashing function block that defines a default interleaving sequence for memory requests to alternately access the multiple memory channels and multiple specific-purpose hashing function blocks that define different interleaving sequences for the memory requests to alternately access the multiple memory channels. The MCU also includes a hashing-function selection block. The hashing-function selection block is operable to select one of the specific-purpose hashing function blocks or the general-purpose hashing function block for a current memory request in view of a requesting functional unit originating the current memory request.
Abstract:
Embodiments detailed herein relate to reduction operations on a plurality of data element values. In one embodiment, a process comprises decoding circuitry to decode an instruction and execution circuitry to execute the decoded instruction. The instruction specifies a first input register containing a plurality of data element values, a first index register containing a plurality of indices, and an output register, where each index of the plurality of indices maps to one unique data element position of the first input register. The execution includes to identify data element values that are associated with one another based on the indices, perform one or more reduction operations on the associated data element values based on the identification, and store results of the one or more reduction operations in the output register.
Abstract:
Methods and apparatus for vector-matrix comparison are disclosed. In one embodiment, a processor comprises decoding and execution circuitry. The decoding circuitry decodes an instruction, where operands of the instruction specifies an output location to store output results, a vector of data element values, and a matrix of data element values. The execution circuitry executes the decoded instruction. The execution includes to map each of the data element values of the vector to one of consecutive rows of the matrix; for each data element value of the vector, to compare that data element value of the vector with data element values in a respective row of the matrix and obtain data element match results. The execution further includes to store the output results based on the data element match results, where each output result maps to a respective data element column position and indicates a vector match result.
Abstract:
Embodiments described herein provide a graphics processor in which dependency tracking hardware is simplified via the use of compiler provided software scoreboard information. In one embodiment the shader compiler for shader programs is configured to encode software scoreboard information into each instruction. Dependencies can be evaluated by the shader compiler and provided as scoreboard information with each instruction. The hardware can then use the provided information when scheduling instructions. In one embodiment, a software scoreboard synchronization instruction is provided to facilitate software dependency handling within a shader program. Using software to facilitate software dependency handling and synchronization can simplify hardware design, reducing the area consumed by the hardware. In one embodiment, dependencies can be evaluated by the shader compiler instead of the GPU hardware. The compiler can then insert a software scoreboard sync immediate instruction into compiled program code to manage instruction dependencies and prevent data hazards from occurring.
Abstract:
Embodiments described herein provide for an instruction and associated logic to enable a vector multiply add instructions with automatic zero skipping for sparse input. One embodiment provides for a general-purpose graphics processor comprising logic to perform operations comprising fetching a hardware macro instruction having a predicate mask, a repeat count, and a set of initial operands, where the initial operands include a destination operand and multiple source operands. The hardware macro instruction is configured to perform one or more multiply/add operations on input data associated with a set of matrices.
Abstract:
A processor is disclosed. The processor includes an execution unit having a register file having one or more banks of registers to store operand values, an accumulator comprising a pool of registers to store operand values determined to cause a conflict at register banks within the register file and cache circuitry to control storage of the operand values determined to cause a conflict at the register banks from the register file to the pool of registers.
Abstract:
Dynamic heterogeneous hashing function technology for balancing memory requests between multiple memory channels is described. A processor includes functional units and multiple memory channels, and a memory controller unit (MCU) coupled between them. The MCU includes a general-purpose hashing function block that defines a default interleaving sequence for memory requests to alternately access the multiple memory channels and multiple specific-purpose hashing function blocks that define different interleaving sequences for the memory requests to alternately access the multiple memory channels. The MCU also includes a hashing-function selection block. The hashing-function selection block is operable to select one of the specific-purpose hashing function blocks or the general-purpose hashing function block for a current memory request in view of a requesting functional unit originating the current memory request.
Abstract:
Embodiments described herein provided for an instruction and associated logic to enable GPGPU program code to access special purpose hardware logic to accelerate dot product operations. One embodiment provides for a graphics processing unit comprising a fetch unit to fetch an instruction for execution and a decode unit to decode the instruction into a decoded instruction. The decoded instruction is a matrix instruction to cause the graphics processing unit to perform a parallel dot product operation. The GPGPU also includes systolic dot product circuitry to execute the decoded instruction across one or more SIMD lanes using multiple systolic layers, wherein to execute the decoded instruction, a dot product computed at a first systolic layer is to be output to a second systolic layer, wherein each systolic layer includes one or more sets of interconnected multipliers and adders, each set of multipliers and adders to generate a dot product.