Patent search ap:("Advanced Micro Devices Page Inc.") AND inv:"Jian Yang"

1.

发明授权
Statically generated compiled representations for processing data in neural networks 有权

公开(公告)号：US11615306B2

公开(公告)日：2023-03-28

申请号：US16894025

申请日：2020-06-05

Applicant: Advanced Micro Devices, Inc.

Inventor： Xiuyu Li , Jian Yang

IPC: G06N3/08 , G06N3/04

Abstract: An electronic device includes a memory that stores input matrices A and B, a cache memory, and a processor. The processor generates a compiled representation that includes values for acquiring data from input matrix A when processing instances of input data through the neural network, the values including a base address in input matrix A for each thread from among a number of threads and relative offsets, the relative offsets being distances between elements of input matrix A to be processed by the threads. The processor then stores, in the local cache memory, the compiled representation including the base address for each thread and the relative offsets.

2.

发明申请
METHOD FOR MATRIX DATA BROADCAST IN PARALLEL PROCESSING 有权

公开(公告)号：US20220129312A1

公开(公告)日：2022-04-28

申请号：US17571374

申请日：2022-01-07

Applicant: Advanced Micro Devices, Inc.

Inventor： Li Peng , Jian Yang , Chi Tang

IPC: G06F9/48 , G06F9/54 , G06F15/78 , G06F12/084 , G06F12/0815

Abstract: Systems, apparatuses, and methods for efficient parallel execution of multiple work units in a processor by reducing a number of memory accesses are disclosed. A computing system includes a processor core with a parallel data architecture. One or more of a software application and firmware implement matrix operations and support the broadcast of shared data to multiple compute units of the processor core. The application creates thread groups by matching compute kernels of the application with data items, and grouping the resulting work units into thread groups. The application assigns the thread groups to compute units based on detecting shared data among the compute units. Rather than send multiple read access to a memory subsystem for the shared data, a single access request is generated. The single access request includes information to identify the multiple compute units for receiving the shared data when broadcasted.

3.

发明授权
Matrix data broadcast architecture 有权

公开(公告)号：US11609785B2

公开(公告)日：2023-03-21

申请号：US16729811

申请日：2019-12-30

Applicant: Advanced Micro Devices, Inc.

Inventor： Li Peng , Jian Yang , Chi Tang

IPC: G06F9/30 , G06F9/38 , G06F9/48 , G06F9/46 , G06F9/32 , G06F9/54

Abstract: Systems, apparatuses, and methods for efficient parallel execution of multiple work units in a processor by reducing a number of memory accesses are disclosed. A computing system includes a processor core with a parallel data architecture. The processor core executes a software application with matrix operations. The processor core supports the broadcast of shared data to multiple compute units of the processor core. A compiler or other code assigns thread groups to compute units based on detecting shared data among the compute units. Rather than send multiple read accesses to a memory subsystem for the shared data, the processor core generates a single access request. The single access request includes information to identify the multiple compute units for receiving the shared data when broadcasted by the processor core.

4.

发明授权
Method for matrix data broadcast in parallel processing 有权

公开(公告)号：US11275612B2

公开(公告)日：2022-03-15

申请号：US16723016

申请日：2019-12-20

Applicant: Advanced Micro Devices, Inc.

Inventor： Li Peng , Jian Yang , Chi Tang

IPC: G06F9/48 , G06F9/54 , G06F15/78 , G06F12/084 , G06F12/0815

Abstract: Systems, apparatuses, and methods for efficient parallel execution of multiple work units in a processor by reducing a number of memory accesses are disclosed. A computing system includes a processor core with a parallel data architecture. One or more of a software application and firmware implement matrix operations and support the broadcast of shared data to multiple compute units of the processor core. The application creates thread groups by matching compute kernels of the application with data items, and grouping the resulting work units into thread groups. The application assigns the thread groups to compute units based on detecting shared data among the compute units. Rather than send multiple read access to a memory subsystem for the shared data, a single access request is generated. The single access request includes information to identify the multiple compute units for receiving the shared data when broadcasted.

5.

发明申请
Statically Generated Compiled Representations for Processing Data in Neural Networks 有权

公开(公告)号：US20210334648A1

公开(公告)日：2021-10-28

申请号：US16894025

申请日：2020-06-05

Applicant: Advanced Micro Devices, Inc.

Inventor： Xiuyu Li , Jian Yang

IPC: G06N3/08 , G06N3/04

Abstract: An electronic device includes a memory that stores input matrices A and B, a cache memory, and a processor. The processor generates a compiled representation that includes values for acquiring data from input matrix A when processing instances of input data through the neural network, the values including a base address in input matrix A for each thread from among a number of threads and relative offsets, the relative offsets being distances between elements of input matrix A to be processed by the threads. The processor then stores, in the local cache memory, the compiled representation including the base address for each thread and the relative offsets.

6.

发明申请
AUTO GENERATION AND TUNING TOOL FOR CONVOLUTION KERNELS 审中-公开

公开(公告)号：US20200302285A1

公开(公告)日：2020-09-24

申请号：US16367093

申请日：2019-03-27

Applicant: Advanced Micro Devices, Inc.

Inventor： Fei Wang , Jian Yang

IPC: G06N3/08 , G06N20/10 , G06T5/20 , G06T5/50

Abstract: Systems, apparatuses, and methods for implementing an auto generation and tuning tool for convolution kernels are disclosed. A processor executes multiple tuning runs of a given layer of a neural network while using a different set of operating parameter values for each tuning run. The operating parameters can include one or more of input dataset fetch group size, output channel group size, and other parameters. The processor captures performance data for each tuning run and then after all tuning runs have finished, the processor determines which set of operating parameter values resulted in a better performance for the given neural network layer. The processor uses these operating parameter values for subsequent iterations of the given layer. The processor also performs the same techniques for other layers to determine which set of operating parameter values to use for each layer so as to maximize performance of the neural network.

7.

发明授权
Auto generation and tuning tool for convolution kernels 有权

公开(公告)号：US11983624B2

公开(公告)日：2024-05-14

申请号：US16367093

申请日：2019-03-27

Applicant: Advanced Micro Devices, Inc.

Inventor： Fei Wang , Jian Yang

IPC: G06T5/50 , G06N3/08 , G06N20/10 , G06T5/20

CPC classification number: G06N3/08 , G06N20/10 , G06T5/20 , G06T5/50 , G06T2207/20084

Abstract: Systems, apparatuses, and methods for implementing an auto generation and tuning tool for convolution kernels are disclosed. A processor executes multiple tuning runs of a given layer of a neural network while using a different set of operating parameter values for each tuning run. The operating parameters can include one or more of input dataset fetch group size, output channel group size, and other parameters. The processor captures performance data for each tuning run and then after all tuning runs have finished, the processor determines which set of operating parameter values resulted in a better performance for the given neural network layer. The processor uses these operating parameter values for subsequent iterations of the given layer. The processor also performs the same techniques for other layers to determine which set of operating parameter values to use for each layer so as to maximize performance of the neural network.

8.

发明申请
MATRIX DATA BROADCAST ARCHITECTURE 有权

公开(公告)号：US20210191761A1

公开(公告)日：2021-06-24

申请号：US16729811

申请日：2019-12-30

Applicant: Advanced Micro Devices, Inc.

Inventor： Li Peng , Jian Yang , Chi Tang

IPC: G06F9/48 , G06F9/46 , G06F9/54 , G06F9/30 , G06F9/32

Abstract: Systems, apparatuses, and methods for efficient parallel execution of multiple work units in a processor by reducing a number of memory accesses are disclosed. A computing system includes a processor core with a parallel data architecture. The processor core executes a software application with matrix operations. The processor core supports the broadcast of shared data to multiple compute units of the processor core. A compiler or other code assigns thread groups to compute units based on detecting shared data among the compute units. Rather than send multiple read accesses to a memory subsystem for the shared data, the processor core generates a single access request. The single access request includes information to identify the multiple compute units for receiving the shared data when broadcasted by the processor core.

9.

发明授权
Method for matrix data broadcast in parallel processing 有权

公开(公告)号：US11983560B2

公开(公告)日：2024-05-14

申请号：US17571374

申请日：2022-01-07

Applicant: Advanced Micro Devices, Inc.

Inventor： Li Peng , Jian Yang , Chi Tang

IPC: G06F9/48 , G06F9/54 , G06F12/0815 , G06F12/084 , G06F15/78

CPC classification number: G06F9/4881 , G06F9/542 , G06F12/0815 , G06F12/084 , G06F15/7807

Abstract: Systems, apparatuses, and methods for efficient parallel execution of multiple work units in a processor by reducing a number of memory accesses are disclosed. A computing system includes a processor core with a parallel data architecture. One or more of a software application and firmware implement matrix operations and support the broadcast of shared data to multiple compute units of the processor core. The application creates thread groups by matching compute kernels of the application with data items, and grouping the resulting work units into thread groups. The application assigns the thread groups to compute units based on detecting shared data among the compute units. Rather than send multiple read access to a memory subsystem for the shared data, a single access request is generated. The single access request includes information to identify the multiple compute units for receiving the shared data when broadcasted.

10.

发明申请
Accelerated Compute Tessellation by Compact Topological Data Structure 有权
Title translation: 通过紧凑拓扑数据结构加速计算细分

公开(公告)号：US20130169636A1

公开(公告)日：2013-07-04

申请号：US13688853

申请日：2012-11-29

Applicant: Advanced Micro Devices, Inc.

Inventor： Jian Yang , Huaibing Zhu , Vineet Goel , Yan Li

IPC: G06T15/08

CPC classification number: G06T15/08 , G06T15/04 , G06T17/20 , G06T17/205

Abstract: A system, method, and computer program product are provided for tessellation using shaders. New graphics pipeline stages implemented by shaders are introduced, including an inner ring shader, an outer edge shader, and topologic shader, which work together with a domain shader and geometry shader to provide tessellated points and primitives. A hull shader is modified to compute values used by the new shaders to perform tessellation algorithms. This approach provides parallelism and customizability to the presently static tessellation engine implementation.

Abstract translation: 提供了使用着色器进行细分的系统，方法和计算机程序产品。引入了由着色器实现的新图形流水线阶段，包括内环着色器，外边缘着色器和拓扑着色器，它们与域着色器和几何着色器一起工作，以提供细分点和基元。修改船体着色器以计算新着色器使用的值来执行镶嵌算法。这种方法提供了对当前静态细分引擎实现的并行性和可定制性。

Search Results

Country/Region

Patent validity

Application date

Publication (announcement) day

applicant

The country/region where the applicant is located

Inventor

IPC

IPC Department

IPC class

IPC subclass

IPC group

IPC team

Appearance classification