-
公开(公告)号:US12229570B2
公开(公告)日:2025-02-18
申请号:US17952270
申请日:2022-09-25
Applicant: Advanced Micro Devices, Inc.
Inventor: Bin He , Michael John Mantor , Brian Emberling , Liang Huang , Chao Liu
Abstract: Block data load with transpose techniques are described. In one example, an input is received, at a control unit, specifying an instruction to load a block of data to at least one memory module using a transpose operation. Responsive to the receiving the input by the control unit, the block of data is caused to be loaded to the at least one memory module by transposing the block of data to form a transposed block of data and storing the transposed block of data in the at least one memory.
-
公开(公告)号:US20240329998A1
公开(公告)日:2024-10-03
申请号:US18619392
申请日:2024-03-28
Applicant: Advanced Micro Devices, Inc.
Inventor: Bin He , Michael J. Mantor , Brian D. Emberling
CPC classification number: G06F9/3802 , G06F9/3001 , G06F9/30098 , G06F9/3867
Abstract: An apparatus and method for efficiently processing multiplication and accumulate operations for matrices in applications. In various implementations, a computing system includes a parallel data processing circuit and a memory. The memory stores the instructions (or translated commands) of a parallel data application. The circuitry of the parallel data processing circuit performs a matrix multiplication operation using source operands accessed only once from a vector register file and multiple instantiations of a vector processing circuit capable of performing multiple matrix multiplication operations corresponding to multiple different types of instructions. The multiplier circuit and the adder circuit of the vector processing circuit perform each of the fused multiply add (FMA) operation and the dot product (inner product) operation without independent, dedicated execution pipelines with one execution pipeline for the FMA operation and the other separate execution pipeline for the dot product operation.
-
公开(公告)号:US11720328B2
公开(公告)日:2023-08-08
申请号:US17029836
申请日:2020-09-23
Applicant: ADVANCED MICRO DEVICES, INC.
Inventor: Bin He , Shubh Shah , Michael Mantor
Abstract: A parallel processing unit employs an arithmetic logic unit (ALU) having a relatively small footprint, thereby reducing the overall power consumption and circuit area of the processing unit. To support the smaller footprint, the ALU includes multiple stages to execute operations corresponding to a received instruction. The ALU executes at least one operation at a precision indicated by the received instruction, and then reduces the resulting data of the at least one operation to a smaller size before providing the results to another stage of the ALU to continue execution of the instruction.
-
公开(公告)号:US11675568B2
公开(公告)日:2023-06-13
申请号:US17121354
申请日:2020-12-14
Applicant: ADVANCED MICRO DEVICES, INC.
Inventor: Bin He , Brian Emberling , Mark Leather , Michael Mantor
CPC classification number: G06F7/57 , G06F9/3867 , G06F17/16 , G06T1/20 , G06F15/8015
Abstract: A processing system executes wavefronts at multiple arithmetic logic unit (ALU) pipelines of a single instruction multiple data (SIMD) unit in a single execution cycle. The ALU pipelines each include a number of ALUs that execute instructions on wavefront operands that are collected from vector general process register (VGPR) banks at a cache and output results of the instructions executed on the wavefronts at a buffer. By storing wavefronts supplied by the VGPR banks at the cache, a greater number of wavefronts can be made available to the SIMD unit without increasing the VGPR bandwidth, enabling multiple ALU pipelines to execute instructions during a single execution cycle.
-
公开(公告)号:US11237827B2
公开(公告)日:2022-02-01
申请号:US16696108
申请日:2019-11-26
Applicant: ADVANCED MICRO DEVICES, INC.
Inventor: Bin He , Jiasheng Chen , Jian Huang
Abstract: A graphics processing unit (GPU) sequences provision of operands to a set of operand registers, thereby allowing the GPU to share at least one of the operand registers between processing. The GPU includes a plurality of arithmetic logic units (ALUs) with at least one of the ALUs configured to perform double precision operations. The GPU further includes a set of operand registers configured to store single precision operands. For a plurality of executing threads that request double precision operations, the GPU stores the corresponding operands at the operand registers. Over a plurality of execution cycles, the GPU sequences transfer of operands from the set of operand registers to a designated double precision operand register. During each execution cycle, the double-precision ALU executes a double precision operation using the operand stored at the double precision operand register.
-
公开(公告)号:US10360177B2
公开(公告)日:2019-07-23
申请号:US15189054
申请日:2016-06-22
Applicant: Advanced Micro Devices, Inc. , ATI Technologies ULC
Inventor: Syed Zohaib M. Gilani , Jiasheng Chen , QingCheng Wang , YunXiao Zou , Michael Mantor , Bin He , Timour T. Paltashev
IPC: G06F15/80 , G06F1/3234 , G06T15/00
Abstract: Described is a method and processing apparatus to improve power efficiency by gating redundant threads processing. In particular, the method for gating redundant threads in a graphics processor includes determining if data for a thread and data for at least another thread are within a predetermined similarity threshold, gating execution of the at least another thread if the data for the thread and the data for the at least another thread are within the predetermined similarity threshold, and using an output data from the thread as an output data for the at least another thread.
-
公开(公告)号:US20190129718A1
公开(公告)日:2019-05-02
申请号:US15799560
申请日:2017-10-31
Applicant: Advanced Micro Devices, Inc.
Inventor: Jiasheng Chen , Bin He , Yunxiao Zou , Michael J. Mantor , Radhakrishna Giduthuri , Eric J. Finger , Brian D. Emberling
Abstract: Systems, apparatuses, and methods for routing traffic between clients and system memory are disclosed. A computing system includes a processor capable of executing single precision mathematical instructions on data sizes of M bits and half precision mathematical instructions on data sizes of N bits, which is less than M bits. At least two source operands with M bits indicated by a received instruction are read from a register file. If the instruction is a packed math instruction, at least a first source operand with a size of N bits less than M bits is selected from either a high portion or a low portion of one of the at least two source operands read from the register file. The instruction includes fields storing bits, each bit indicating the high portion or the low portion of a given source operand associated with a register identifier specified elsewhere in the instruction.
-
8.
公开(公告)号:US20180121386A1
公开(公告)日:2018-05-03
申请号:US15354560
申请日:2016-11-17
Applicant: Advanced Micro Devices, Inc.
Inventor: Jiasheng Chen , Angel E. Socarras , Michael Mantor , YunXiao Zou , Bin He
IPC: G06F15/80 , G06F9/30 , G06F12/0875 , G06F12/0891
CPC classification number: G06F15/8007 , G06F9/3001 , G06F9/30105 , G06F9/3012 , G06F9/30123 , G06F9/3828 , G06F9/3851 , G06F9/3887 , G06F9/3891 , G06F12/0875 , G06F12/0891 , G06F2212/604
Abstract: A super single instruction, multiple data (SIMD) computing structure and a method of executing instructions in the super-SIMD is disclosed. The super-SIMD structure is capable of executing more than one instruction from a single or multiple thread and includes a plurality of vector general purpose registers (VGPRs), a first arithmetic logic unit (ALU), the first ALU coupled to the plurality of VGPRs, a second ALU, the second ALU coupled to the plurality of VGPRs, and a destination cache (Do$) that is coupled via bypass and forwarding logic to the first ALU, the second ALU and receiving an output of the first ALU and the second ALU. The Do$ holds multiple instructions results to extend an operand by-pass network to save read and write transactions power. A compute unit (CU) and a small CU including a plurality of super-SIMDs are also disclosed.
-
公开(公告)号:US20230097279A1
公开(公告)日:2023-03-30
申请号:US17489734
申请日:2021-09-29
Applicant: Advanced Micro Devices, Inc.
Inventor: Brian Emberling , Michael Mantor , Michael Y. Chow , Bin He
Abstract: Methods and systems are disclosed for executing operations on single-instruction-multiple-data (SIMD) units. Techniques disclosed perform a dot product operation on input data during one computer cycle, including convolving the input data, generating intermediate data, and applying one or more transitional operations to the intermediate data to generate output data. Aspects described, wherein the input data is an input to a layer of a convolutional neural network and the generated output data is the output of the layer.
-
公开(公告)号:US11409536B2
公开(公告)日:2022-08-09
申请号:US15342809
申请日:2016-11-03
Applicant: Advanced Micro Devices, Inc.
Inventor: Bin He , YunXiao Zou , Jiasheng Chen , Michael Mantor
Abstract: A method and apparatus for performing a multi-precision computation in a plurality of arithmetic logic units (ALUs) includes pairing a first Single Instruction/Multiple Data (SIMD) block channel device with a second SIMD block channel device to create a first block pair having one-level staggering between the first and second channel devices. A third SIMD block channel device is paired with a fourth SIMD block channel device to create a second block pair having one-level staggering between the third and fourth channel devices. A plurality of source inputs are received at the first block pair and the second block pair. The first block pair computes a first result, and the second block pair computes a second result.
-
-
-
-
-
-
-
-
-