-
公开(公告)号:US11768664B2
公开(公告)日:2023-09-26
申请号:US16591031
申请日:2019-10-02
Applicant: ADVANCED MICRO DEVICES, INC.
Inventor: Bin He , Michael Mantor , Jiasheng Chen
CPC classification number: G06F7/57 , G06F7/483 , G06F7/5443 , G06F9/3818 , G06F2207/3824
Abstract: A graphics processing unit (GPU) implements operations, with associated op codes, to perform mixed precision mathematical operations. The GPU includes an arithmetic logic unit (ALU) with different execution paths, wherein each execution path executes a different mixed precision operation. By implementing mixed precision operations at the ALU in response to designate op codes that delineate the operations, the GPU efficiently increases the precision of specified mathematical operations while reducing execution overhead.
-
公开(公告)号:US11762658B2
公开(公告)日:2023-09-19
申请号:US16581252
申请日:2019-09-24
Applicant: ADVANCED MICRO DEVICES, INC.
Inventor: Bin He , Michael Mantor , Jiasheng Chen , Jian Huang
CPC classification number: G06F9/30036 , G06F9/30101 , G06F9/3877 , G06F9/544 , G06F17/16
Abstract: A processing unit such as a graphics processing unit (GPU) includes a plurality of vector signal processors (VSPs) that include multiply/accumulate elements. The processing unit also includes a plurality of registers associated with the plurality of VSPs. First portions of first and second matrices are fetched into the plurality of registers prior to a first round that includes a plurality of iterations. The multiply/accumulate elements perform matrix multiplication and accumulation on different combinations of subsets of the first portions of the first and second matrices in the plurality of iterations prior to fetching second portions of the first and second matrices into the plurality of registers for a second round. The accumulated results of multiplying the first portions of the first and second matrices are written into an output buffer in response to completing the plurality of iterations.
-
公开(公告)号:US11630667B2
公开(公告)日:2023-04-18
申请号:US16697660
申请日:2019-11-27
Applicant: ADVANCED MICRO DEVICES, INC.
Inventor: Jiasheng Chen , Bin He , Jian Huang , Michael Mantor
Abstract: A processor includes a plurality of vector sub-processors (VSPs) and a plurality of memory banks dedicated to respective VSPs. A first memory bank corresponding to a first VSP includes a first plurality of high vector general purpose register (VGPR) banks and a first plurality of low VGPR banks corresponding to the first plurality of high VGPR banks. The first memory bank further includes a plurality of operand gathering components that store operands from respective high VGPR banks and low VGPR banks. The operand gathering components are assigned to individual threads while the threads are executed by the first VSP.
-
公开(公告)号:US11494192B2
公开(公告)日:2022-11-08
申请号:US16860842
申请日:2020-04-28
Inventor: Jiasheng Chen , YunXiao Zou , Bin He , Angel E. Socarras , QingCheng Wang , Wei Yuan , Michael Mantor
Abstract: A processing element is implemented in a stage of a pipeline and configured to execute an instruction. A first array of multiplexers is to provide information associated with the instruction to the processing element in response to the instruction being in a first set of instructions. A second array of multiplexers is to provide information associated with the instruction to the first processing element in response to the instruction being in a second set of instructions. A control unit is to gate at least one of power or a clock signal provided to the first array of multiplexers in response to the instruction being in the second set.
-
公开(公告)号:US10970081B2
公开(公告)日:2021-04-06
申请号:US15637629
申请日:2017-06-29
Applicant: Advanced Micro Devices, Inc.
Inventor: Jiasheng Chen , Bin He , Mohammad Reza Hakami , Timothy Lottes , Justin David Smith , Michael J. Mantor , Derek Carson
Abstract: Systems, apparatuses, and methods for implementing a decoupled crossbar for a stream processor are disclosed. In one embodiment, a system includes at least a multi-lane execution pipeline, a vector register file, and a crossbar. The system is configured to determine if a given instruction in an instruction stream requires a permutation on data operands retrieved from the vector register file. The system conveys the data operands to the multi-lane execution pipeline on a first path which includes the crossbar responsive to determining the given instruction requires a permutation on the data operands. The crossbar then performs the necessary permutation to route the data operands to the proper processing lanes. Otherwise, the system conveys the data operands to the multi-lane execution pipeline on a second path which bypasses the crossbar responsive to determining the given instruction does not require a permutation on the input operands.
-
公开(公告)号:US10817302B2
公开(公告)日:2020-10-27
申请号:US15644045
申请日:2017-07-07
Applicant: Advanced Micro Devices, Inc.
Inventor: Jiasheng Chen , Bin He , Mark M. Leather , Michael J. Mantor , Yunxiao Zou
IPC: G06F9/38 , G06F9/30 , G06F12/0891 , G06F12/0855 , G06F12/0804 , G06F12/121 , G06F12/0875
Abstract: Systems, apparatuses, and methods for implementing a high bandwidth, low power vector register file for use by a parallel processor are disclosed. In one embodiment, a system includes at least a parallel processing unit with a plurality of processing pipeline. The parallel processing unit includes a vector arithmetic logic unit and a high bandwidth, low power, vector register file. The vector register file includes multi-bank high density random-access memories (RAMs) to satisfy register bandwidth requirements. The parallel processing unit also includes an instruction request queue and an instruction operand buffer to provide enough local bandwidth for VALU instructions and vector I/O instructions. Also, the parallel processing unit is configured to leverage the RAM's output flops as a last level cache to reduce duplicate operand requests between multiple instructions. The parallel processing unit includes a vector destination cache to provide additional R/W bandwidth for the vector register file.
-
公开(公告)号:US10656951B2
公开(公告)日:2020-05-19
申请号:US15789318
申请日:2017-10-20
Inventor: Jiasheng Chen , YunXiao Zou , Bin He , Angel E. Socarras , QingCheng Wang , Wei Yuan , Michael Mantor
Abstract: A processing element is implemented in a stage of a pipeline and configured to execute an instruction. A first array of multiplexers is to provide information associated with the instruction to the processing element in response to the instruction being in a first set of instructions. A second array of multiplexers is to provide information associated with the instruction to the first processing element in response to the instruction being in a second set of instructions. A control unit is to gate at least one of power or a clock signal provided to the first array of multiplexers in response to the instruction being in the second set.
-
公开(公告)号:US20190004814A1
公开(公告)日:2019-01-03
申请号:US15637629
申请日:2017-06-29
Applicant: Advanced Micro Devices, Inc.
Inventor: Jiasheng Chen , Bin He , Mohammad Reza Hakami , Timothy Lottes , Justin David Smith , Michael J. Mantor , Derek Carson
Abstract: Systems, apparatuses, and methods for implementing a decoupled crossbar for a stream processor are disclosed. In one embodiment, a system includes at least a multi-lane execution pipeline, a vector register file, and a crossbar. The system is configured to determine if a given instruction in an instruction stream requires a permutation on data operands retrieved from the vector register file. The system conveys the data operands to the multi-lane execution pipeline on a first path which includes the crossbar responsive to determining the given instruction requires a permutation on the data operands. The crossbar then performs the necessary permutation to route the data operands to the proper processing lanes. Otherwise, the system conveys the data operands to the multi-lane execution pipeline on a second path which bypasses the crossbar responsive to determining the given instruction does not require a permutation on the input operands.
-
29.
公开(公告)号:US20180113709A1
公开(公告)日:2018-04-26
申请号:US15342809
申请日:2016-11-03
Applicant: Advanced Micro Devices, Inc.
Inventor: Bin He , YunXiao Zou , Jiasheng Chen , Michael Mantor
CPC classification number: G06F9/3887 , G06F9/30014 , G06F9/30036 , G06F9/3893
Abstract: A method and apparatus for performing a multi-precision computation in a plurality of arithmetic logic units (ALUs) includes pairing a first Single Instruction/Multiple Data (SIMD) block channel device with a second SIMD block channel device to create a first block pair having one-level staggering between the first and second channel devices. A third SIMD block channel device is paired with a fourth SIMD block channel device to create a second block pair having one-level staggering between the third and fourth channel devices. A plurality of source inputs are received at the first block pair and the second block pair. The first block pair computes a first result, and the second block pair computes a second result.
-
公开(公告)号:US20250130774A1
公开(公告)日:2025-04-24
申请号:US18395190
申请日:2023-12-22
Applicant: Advanced Micro Devices, Inc.
Inventor: Shubh Shah , Ashutosh Garg , Bin He , Michael Mantor , Shubra Marwaha , Subramaniam Maiyuran
Abstract: The disclosed circuit can interpret a bit sequence as a value based on one of multiple floating point number formats in a bias mode indicated by a bias mode indicator. The circuit can and perform an operation using the value in the bias mode. Various other methods, systems, and computer-readable media are also disclosed.
-
-
-
-
-
-
-
-
-