-
公开(公告)号:US10936697B2
公开(公告)日:2021-03-02
申请号:US16044145
申请日:2018-07-24
Applicant: Advanced Micro Devices, Inc.
Inventor: Khaled Hamidouche , Michael W. LeBeane , Nicholas P. Malaya , Joseph L. Greathouse
Abstract: A method includes storing a first portion of a sparse triangular matrix in a local memory and launching a kernel for executing a set of workgroups. The first portion includes a plurality of row blocks, and each workgroup in the set of workgroups is associated with one of the plurality of row blocks. The method also includes, for each workgroup in the set of workgroups, solving the row block. The row block is solved by, for each row segment of a first subset of row segments in the row block, calculating a partial sum for the row segment based on one or more matrix elements in the row segment, and writing the partial sum to a remote memory of a first remote processing unit prior to terminating the kernel.
-
公开(公告)号:US10740163B2
公开(公告)日:2020-08-11
申请号:US16022498
申请日:2018-06-28
Applicant: Advanced Micro Devices, Inc.
Inventor: Khaled Hamidouche , Michael Wayne LeBeane , Walter B. Benton
Abstract: Systems, apparatuses, and methods for performing network packet templating for graphics processing unit (GPU)-initiated communication are disclosed. A central processing unit (CPU) creates a network packet according to a template and populates a first subset of fields of the network packet with static data. Next, the CPU stores the network packet in a memory. A GPU initiates execution of a kernel and detects a network communication request within the kernel and prior to the kernel completing execution. Responsive to this determination, the GPU populates a second subset of fields of the network packet with runtime data. Then, the GPU generates a notification that the network packet is ready to be processed. A network interface controller (NIC) processes the network packet using data retrieved from the first subset of fields and from the second subset of fields responsive to detecting the notification.
-
公开(公告)号:US20200034405A1
公开(公告)日:2020-01-30
申请号:US16044145
申请日:2018-07-24
Applicant: Advanced Micro Devices, Inc.
Inventor: Khaled Hamidouche , Michael W. LeBeane , Nicholas P. Malaya , Joseph L. Greathouse
Abstract: A method includes storing a first portion of a sparse triangular matrix in a local memory and launching a kernel for executing a set of workgroups. The first portion includes a plurality of row blocks, and each workgroup in the set of workgroups is associated with one of the plurality of row blocks. The method also includes, for each workgroup in the set of workgroups, solving the row block. The row block is solved by, for each row segment of a first subset of row segments in the row block, calculating a partial sum for the row segment based on one or more matrix elements in the row segment, and writing the partial sum to a remote memory of a first remote processing unit prior to terminating the kernel.
-
公开(公告)号:US20200004610A1
公开(公告)日:2020-01-02
申请号:US16022498
申请日:2018-06-28
Applicant: Advanced Micro Devices, Inc.
Inventor: Khaled Hamidouche , Michael Wayne LeBeane , Walter B. Benton
Abstract: Systems, apparatuses, and methods for performing network packet templating for graphics processing unit (GPU)-initiated communication are disclosed. A central processing unit (CPU) creates a network packet according to a template and populates a first subset of fields of the network packet with static data. Next, the CPU stores the network packet in a memory. A GPU initiates execution of a kernel and detects a network communication request within the kernel and prior to the kernel completing execution. Responsive to this determination, the GPU populates a second subset of fields of the network packet with runtime data. Then, the GPU generates a notification that the network packet is ready to be processed. A network interface controller (NIC) processes the network packet using data retrieved from the first subset of fields and from the second subset of fields responsive to detecting the notification.
-
公开(公告)号:US20240311182A1
公开(公告)日:2024-09-19
申请号:US18185641
申请日:2023-03-17
Applicant: Advanced Micro Devices, Inc.
Inventor: Kishore Punniyamurthy , Sagnik Basu , Khaled Hamidouche , Brandon Keith Potter
IPC: G06F9/48
CPC classification number: G06F9/4881
Abstract: A device includes a communication scheduler to generate schedule trees for scheduling data communication among a plurality of nodes configured to perform a collective operation using data contributed from the plurality of nodes. The device includes data reduction logic to: identify one or more skewed nodes among the plurality of nodes, perform, according to a first set of schedule trees, a first operation to generate partial results based on data contributed from non-skewed nodes, and perform, according to a second set of schedule trees, a second operation to generate final results based on the partial results and data contributed from the one or more skewed nodes.
-
16.
公开(公告)号:US20240211399A1
公开(公告)日:2024-06-27
申请号:US18089480
申请日:2022-12-27
Applicant: Advanced Micro Devices, Inc.
Inventor: Kishore Punniyamurthy , Khaled Hamidouche , Brandon Keith Potter
IPC: G06F12/0813 , G06N20/00
CPC classification number: G06F12/0813 , G06N20/00
Abstract: A distributed cache network used for machine learning is provided which comprises a network fabric having file systems which store data and a plurality of processing devices, each comprising cache memory and a processor configured to execute a training of a machine learning model and selectively cache portions of the data based on a frequency with which the data is accessed by the processor. Each processing device stores metadata identifying portions of data which are cached in the cache memory and other portions of the data which are cached in other processing devices of the network. When requested data is not cached in another processing device, the portion of requested data is accessed from a network file system via a client to server channel and is accessed from another processing device via a client to client channel when the requested data is cached in the other processing device.
-
公开(公告)号:US11922207B2
公开(公告)日:2024-03-05
申请号:US16993150
申请日:2020-08-13
Applicant: Advanced Micro Devices, Inc.
Inventor: Michael W. LeBeane , Khaled Hamidouche , Brandon K. Potter
CPC classification number: G06F9/48 , G06F9/3836 , G06F9/3887 , G06F9/54 , H04L67/10 , G06T1/20
Abstract: An approach is provided for coalescing network commands in a GPU that implements a SIMT architecture. Compatible next network operations from different threads are coalesced into a single network command packet. This reduces the number of network command packets generated and issued by threads, thereby increasing efficiency, and improving throughput. The approach is applicable to any number of threads and any thread organization methodology, such as wavefronts, warps, etc.
-
公开(公告)号:US20240005126A1
公开(公告)日:2024-01-04
申请号:US17853670
申请日:2022-06-29
Applicant: Advanced Micro Devices, Inc.
Inventor: Kishore Punniyamurthy , Khaled Hamidouche , Brandon K. Potter , Rohit Shahaji Zambre
Abstract: An electronic device includes one or more data producing nodes and a data consuming node. Each data producing node separately generates two or more portions of a respective block of data. Upon completing generating each portion of the two or more portions of the respective block of data, each data producing node communicates that portion of the respective block of data to the data consuming node. Upon receiving corresponding portions of the respective blocks of data from each of the one or more data producing nodes, the data consuming node performs operations for a model using the corresponding portions of the respective blocks of data.
-
公开(公告)号:US20230120934A1
公开(公告)日:2023-04-20
申请号:US18068836
申请日:2022-12-20
Applicant: Advanced Micro Devices, Inc.
Inventor: Michael Wayne LeBeane , Khaled Hamidouche , Walter B. Benton
Abstract: Systems, apparatuses, and methods for generating network messages on a parallel processor are disclosed. A system includes at least a parallel processor, a general purpose processor, and a network interface unit. The parallel processor includes at least a plurality of compute units, a command processor, and a cache. A thread within a kernel executing on a compute unit of the parallel processor generates a network message and stores the network message and a corresponding indication in the cache. In response to detecting the indication of the network message in the cache, the command processor processes and conveys the network message to the network interface unit without involving the general purpose processor.
-
公开(公告)号:US11544121B2
公开(公告)日:2023-01-03
申请号:US15815043
申请日:2017-11-16
Applicant: Advanced Micro Devices, Inc.
Inventor: Michael Wayne LeBeane , Khaled Hamidouche , Walter B. Benton
Abstract: Systems, apparatuses, and methods for generating network messages on a parallel processor are disclosed. A system includes at least a parallel processor, a general purpose processor, and a network interface unit. The parallel processor includes at least a plurality of compute units, a command processor, and a cache. A thread within a kernel executing on a compute unit of the parallel processor generates a network message and stores the network message and a corresponding indication in the cache. In response to detecting the indication of the network message in the cache, the command processor processes and conveys the network message to the network interface unit without involving the general purpose processor.
-
-
-
-
-
-
-
-
-