WAVE LEVEL MATRIX MULTIPLY INSTRUCTIONS
    2.
    发明公开

    公开(公告)号:US20240329998A1

    公开(公告)日:2024-10-03

    申请号:US18619392

    申请日:2024-03-28

    CPC classification number: G06F9/3802 G06F9/3001 G06F9/30098 G06F9/3867

    Abstract: An apparatus and method for efficiently processing multiplication and accumulate operations for matrices in applications. In various implementations, a computing system includes a parallel data processing circuit and a memory. The memory stores the instructions (or translated commands) of a parallel data application. The circuitry of the parallel data processing circuit performs a matrix multiplication operation using source operands accessed only once from a vector register file and multiple instantiations of a vector processing circuit capable of performing multiple matrix multiplication operations corresponding to multiple different types of instructions. The multiplier circuit and the adder circuit of the vector processing circuit perform each of the fused multiply add (FMA) operation and the dot product (inner product) operation without independent, dedicated execution pipelines with one execution pipeline for the FMA operation and the other separate execution pipeline for the dot product operation.

    Processing unit with small footprint arithmetic logic unit

    公开(公告)号:US11720328B2

    公开(公告)日:2023-08-08

    申请号:US17029836

    申请日:2020-09-23

    CPC classification number: G06F7/57 G06F17/16 G06N3/08

    Abstract: A parallel processing unit employs an arithmetic logic unit (ALU) having a relatively small footprint, thereby reducing the overall power consumption and circuit area of the processing unit. To support the smaller footprint, the ALU includes multiple stages to execute operations corresponding to a received instruction. The ALU executes at least one operation at a precision indicated by the received instruction, and then reduces the resulting data of the at least one operation to a smaller size before providing the results to another stage of the ALU to continue execution of the instruction.

    Dual vector arithmetic logic unit

    公开(公告)号:US11675568B2

    公开(公告)日:2023-06-13

    申请号:US17121354

    申请日:2020-12-14

    CPC classification number: G06F7/57 G06F9/3867 G06F17/16 G06T1/20 G06F15/8015

    Abstract: A processing system executes wavefronts at multiple arithmetic logic unit (ALU) pipelines of a single instruction multiple data (SIMD) unit in a single execution cycle. The ALU pipelines each include a number of ALUs that execute instructions on wavefront operands that are collected from vector general process register (VGPR) banks at a cache and output results of the instructions executed on the wavefronts at a buffer. By storing wavefronts supplied by the VGPR banks at the cache, a greater number of wavefronts can be made available to the SIMD unit without increasing the VGPR bandwidth, enabling multiple ALU pipelines to execute instructions during a single execution cycle.

    Arithemetic logic unit register sequencing

    公开(公告)号:US11237827B2

    公开(公告)日:2022-02-01

    申请号:US16696108

    申请日:2019-11-26

    Abstract: A graphics processing unit (GPU) sequences provision of operands to a set of operand registers, thereby allowing the GPU to share at least one of the operand registers between processing. The GPU includes a plurality of arithmetic logic units (ALUs) with at least one of the ALUs configured to perform double precision operations. The GPU further includes a set of operand registers configured to store single precision operands. For a plurality of executing threads that request double precision operations, the GPU stores the corresponding operands at the operand registers. Over a plurality of execution cycles, the GPU sequences transfer of operands from the set of operand registers to a designated double precision operand register. During each execution cycle, the double-precision ALU executes a double precision operation using the operand stored at the double precision operand register.

    PACKED 16 BITS INSTRUCTION PIPELINE
    7.
    发明申请

    公开(公告)号:US20190129718A1

    公开(公告)日:2019-05-02

    申请号:US15799560

    申请日:2017-10-31

    Abstract: Systems, apparatuses, and methods for routing traffic between clients and system memory are disclosed. A computing system includes a processor capable of executing single precision mathematical instructions on data sizes of M bits and half precision mathematical instructions on data sizes of N bits, which is less than M bits. At least two source operands with M bits indicated by a received instruction are read from a register file. If the instruction is a packed math instruction, at least a first source operand with a size of N bits less than M bits is selected from either a high portion or a low portion of one of the at least two source operands read from the register file. The instruction includes fields storing bits, each bit indicating the high portion or the low portion of a given source operand associated with a register identifier specified elsewhere in the instruction.

    CONVOLUTIONAL NEURAL NETWORK OPERATIONS

    公开(公告)号:US20230097279A1

    公开(公告)日:2023-03-30

    申请号:US17489734

    申请日:2021-09-29

    Abstract: Methods and systems are disclosed for executing operations on single-instruction-multiple-data (SIMD) units. Techniques disclosed perform a dot product operation on input data during one computer cycle, including convolving the input data, generating intermediate data, and applying one or more transitional operations to the intermediate data to generate output data. Aspects described, wherein the input data is an input to a layer of a convolutional neural network and the generated output data is the output of the layer.

    Pairing SIMD lanes to perform double precision operations

    公开(公告)号:US11409536B2

    公开(公告)日:2022-08-09

    申请号:US15342809

    申请日:2016-11-03

    Abstract: A method and apparatus for performing a multi-precision computation in a plurality of arithmetic logic units (ALUs) includes pairing a first Single Instruction/Multiple Data (SIMD) block channel device with a second SIMD block channel device to create a first block pair having one-level staggering between the first and second channel devices. A third SIMD block channel device is paired with a fourth SIMD block channel device to create a second block pair having one-level staggering between the third and fourth channel devices. A plurality of source inputs are received at the first block pair and the second block pair. The first block pair computes a first result, and the second block pair computes a second result.

Patent Agency Ranking