-
公开(公告)号:US11615306B2
公开(公告)日:2023-03-28
申请号:US16894025
申请日:2020-06-05
Applicant: Advanced Micro Devices, Inc.
Abstract: An electronic device includes a memory that stores input matrices A and B, a cache memory, and a processor. The processor generates a compiled representation that includes values for acquiring data from input matrix A when processing instances of input data through the neural network, the values including a base address in input matrix A for each thread from among a number of threads and relative offsets, the relative offsets being distances between elements of input matrix A to be processed by the threads. The processor then stores, in the local cache memory, the compiled representation including the base address for each thread and the relative offsets.
-
公开(公告)号:US20220129312A1
公开(公告)日:2022-04-28
申请号:US17571374
申请日:2022-01-07
Applicant: Advanced Micro Devices, Inc.
IPC: G06F9/48 , G06F9/54 , G06F15/78 , G06F12/084 , G06F12/0815
Abstract: Systems, apparatuses, and methods for efficient parallel execution of multiple work units in a processor by reducing a number of memory accesses are disclosed. A computing system includes a processor core with a parallel data architecture. One or more of a software application and firmware implement matrix operations and support the broadcast of shared data to multiple compute units of the processor core. The application creates thread groups by matching compute kernels of the application with data items, and grouping the resulting work units into thread groups. The application assigns the thread groups to compute units based on detecting shared data among the compute units. Rather than send multiple read access to a memory subsystem for the shared data, a single access request is generated. The single access request includes information to identify the multiple compute units for receiving the shared data when broadcasted.
-
公开(公告)号:US11609785B2
公开(公告)日:2023-03-21
申请号:US16729811
申请日:2019-12-30
Applicant: Advanced Micro Devices, Inc.
Abstract: Systems, apparatuses, and methods for efficient parallel execution of multiple work units in a processor by reducing a number of memory accesses are disclosed. A computing system includes a processor core with a parallel data architecture. The processor core executes a software application with matrix operations. The processor core supports the broadcast of shared data to multiple compute units of the processor core. A compiler or other code assigns thread groups to compute units based on detecting shared data among the compute units. Rather than send multiple read accesses to a memory subsystem for the shared data, the processor core generates a single access request. The single access request includes information to identify the multiple compute units for receiving the shared data when broadcasted by the processor core.
-
公开(公告)号:US11275612B2
公开(公告)日:2022-03-15
申请号:US16723016
申请日:2019-12-20
Applicant: Advanced Micro Devices, Inc.
IPC: G06F9/48 , G06F9/54 , G06F15/78 , G06F12/084 , G06F12/0815
Abstract: Systems, apparatuses, and methods for efficient parallel execution of multiple work units in a processor by reducing a number of memory accesses are disclosed. A computing system includes a processor core with a parallel data architecture. One or more of a software application and firmware implement matrix operations and support the broadcast of shared data to multiple compute units of the processor core. The application creates thread groups by matching compute kernels of the application with data items, and grouping the resulting work units into thread groups. The application assigns the thread groups to compute units based on detecting shared data among the compute units. Rather than send multiple read access to a memory subsystem for the shared data, a single access request is generated. The single access request includes information to identify the multiple compute units for receiving the shared data when broadcasted.
-
公开(公告)号:US20210334648A1
公开(公告)日:2021-10-28
申请号:US16894025
申请日:2020-06-05
Applicant: Advanced Micro Devices, Inc.
Abstract: An electronic device includes a memory that stores input matrices A and B, a cache memory, and a processor. The processor generates a compiled representation that includes values for acquiring data from input matrix A when processing instances of input data through the neural network, the values including a base address in input matrix A for each thread from among a number of threads and relative offsets, the relative offsets being distances between elements of input matrix A to be processed by the threads. The processor then stores, in the local cache memory, the compiled representation including the base address for each thread and the relative offsets.
-
公开(公告)号:US20200302285A1
公开(公告)日:2020-09-24
申请号:US16367093
申请日:2019-03-27
Applicant: Advanced Micro Devices, Inc.
Abstract: Systems, apparatuses, and methods for implementing an auto generation and tuning tool for convolution kernels are disclosed. A processor executes multiple tuning runs of a given layer of a neural network while using a different set of operating parameter values for each tuning run. The operating parameters can include one or more of input dataset fetch group size, output channel group size, and other parameters. The processor captures performance data for each tuning run and then after all tuning runs have finished, the processor determines which set of operating parameter values resulted in a better performance for the given neural network layer. The processor uses these operating parameter values for subsequent iterations of the given layer. The processor also performs the same techniques for other layers to determine which set of operating parameter values to use for each layer so as to maximize performance of the neural network.
-
公开(公告)号:US11983624B2
公开(公告)日:2024-05-14
申请号:US16367093
申请日:2019-03-27
Applicant: Advanced Micro Devices, Inc.
CPC classification number: G06N3/08 , G06N20/10 , G06T5/20 , G06T5/50 , G06T2207/20084
Abstract: Systems, apparatuses, and methods for implementing an auto generation and tuning tool for convolution kernels are disclosed. A processor executes multiple tuning runs of a given layer of a neural network while using a different set of operating parameter values for each tuning run. The operating parameters can include one or more of input dataset fetch group size, output channel group size, and other parameters. The processor captures performance data for each tuning run and then after all tuning runs have finished, the processor determines which set of operating parameter values resulted in a better performance for the given neural network layer. The processor uses these operating parameter values for subsequent iterations of the given layer. The processor also performs the same techniques for other layers to determine which set of operating parameter values to use for each layer so as to maximize performance of the neural network.
-
公开(公告)号:US20210191761A1
公开(公告)日:2021-06-24
申请号:US16729811
申请日:2019-12-30
Applicant: Advanced Micro Devices, Inc.
Abstract: Systems, apparatuses, and methods for efficient parallel execution of multiple work units in a processor by reducing a number of memory accesses are disclosed. A computing system includes a processor core with a parallel data architecture. The processor core executes a software application with matrix operations. The processor core supports the broadcast of shared data to multiple compute units of the processor core. A compiler or other code assigns thread groups to compute units based on detecting shared data among the compute units. Rather than send multiple read accesses to a memory subsystem for the shared data, the processor core generates a single access request. The single access request includes information to identify the multiple compute units for receiving the shared data when broadcasted by the processor core.
-
公开(公告)号:US20210191758A1
公开(公告)日:2021-06-24
申请号:US16723016
申请日:2019-12-20
Applicant: Advanced Micro Devices, Inc.
IPC: G06F9/48 , G06F9/54 , G06F12/0815 , G06F12/084 , G06F15/78
Abstract: Systems, apparatuses, and methods for efficient parallel execution of multiple work units in a processor by reducing a number of memory accesses are disclosed. A computing system includes a processor core with a parallel data architecture. One or more of a software application and firmware implement matrix operations and support the broadcast of shared data to multiple compute units of the processor core. The application creates thread groups by matching compute kernels of the application with data items, and grouping the resulting work units into thread groups. The application assigns the thread groups to compute units based on detecting shared data among the compute units. Rather than send multiple read access to a memory subsystem for the shared data, a single access request is generated. The single access request includes information to identify the multiple compute units for receiving the shared data when broadcasted.
-
公开(公告)号:US20190004807A1
公开(公告)日:2019-01-03
申请号:US15657478
申请日:2017-07-24
Applicant: Advanced Micro Devices, Inc.
Inventor: Jiasheng Chen , Qingcheng Wang , Yunxiao Zou , Bin He , Jian Yang , Michael J. Mantor , Brian D. Emberling
IPC: G06F9/38
Abstract: Systems, apparatuses, and methods for implementing a stream processor with overlapping execution are disclosed. In one embodiment, a system includes at least a parallel processing unit with a plurality of execution pipelines. The processing throughput of the parallel processing unit is increased by overlapping execution of multi-pass instructions with single pass instructions without increasing the instruction issue rate. A first plurality of operands of a first vector instruction are read from a shared vector register file in a single clock cycle and stored in temporary storage. The first plurality of operands are accessed and utilized to initiate multiple instructions on individual vector elements on a first execution pipeline in subsequent clock cycles. A second plurality of operands are read from the shared vector register file during the subsequent clock cycles to initiate execution of one or more second vector instructions on the second execution pipeline.
-
-
-
-
-
-
-
-
-