Patent search ap:("NVIDIA CORPORATION") AND inv:"Ajay Sudarshan Tirumala" Page 1

1.

发明申请
EFFICIENT VECTOR-MATRIX MULTIPLY OPERATIONS ACROSS PARALLEL PROCESSING UNIT THREADS 有权

公开(公告)号：US20250021622A1

公开(公告)日：2025-01-16

申请号：US18769710

申请日：2024-07-11

Applicant: NVIDIA Corporation

Inventor： Sean Jeffrey Treichler , Yury Uralsky , Karthik Vaidyanathan , Franz Petrik Clarberg , Jeffrey Alan Bolz , John Matthew Burgess , Ajay Sudarshan Tirumala

IPC: G06F17/16

Abstract: Disclosed are systems and techniques for efficient vector-matrix multiply operations across parallel processing unit threads. The techniques include receiving first data of a first thread, the first data comprising a first input vector and a first matrix. The techniques further include receiving second data of a second thread, the second data comprising a second input vector and a second matrix. The techniques further include combining the first input vector and the second input vector into an input matrix and generating a result matrix at least by multiplying the input matrix by the first matrix using a matrix-multiply circuit. The techniques further include separating the result matrix into a first result value and a second result value, the first result value corresponding to the first thread and the second result value corresponding to the second thread.

2.

发明申请
TECHNIQUES FOR COMPREHENSIVELY SYNCHRONIZING EXECUTION THREADS 审中-公开

公开(公告)号：US20180314520A1

公开(公告)日：2018-11-01

申请号：US15499843

申请日：2017-04-27

Applicant: NVIDIA Corporation

Inventor： Ajay Sudarshan Tirumala , Olivier Giroux , Peter Nelson , Jack Choquette

IPC: G06F9/30 , G06F9/38

CPC classification number: G06F9/3009 , G06F9/30087 , G06F9/3851 , G06F9/46

Abstract: In one embodiment, a synchronization instruction causes a processor to ensure that specified threads included within a warp concurrently execute a single subsequent instruction. The specified threads include at least a first thread and a second thread. In operation, the first thread arrives at the synchronization instruction. The processor determines that the second thread has not yet arrived at the synchronization instruction and configures the first thread to stop executing instructions. After issuing at least one instruction for the second thread, the processor determines that all the specified threads have arrived at the synchronization instruction. The processor then causes all the specified threads to execute the subsequent instruction. Advantageously, unlike conventional approaches to synchronizing threads, the synchronization instruction enables the processor to reliably and properly execute code that includes complex control flows and/or instructions that presuppose that threads are converged.

3.

发明申请
TECHNIQUES FOR COMPREHENSIVELY SYNCHRONIZING EXECUTION THREADS 审中-公开

公开(公告)号：US20200034143A1

公开(公告)日：2020-01-30

申请号：US16595398

申请日：2019-10-07

Applicant: NVIDIA Corporation

Inventor： Ajay Sudarshan Tirumala , Olivier Giroux , Peter Nelson , Jack Choquette

IPC: G06F9/30 , G06F9/38 , G06F9/46

Abstract: In one embodiment, a synchronization instruction causes a processor to ensure that specified threads included within a warp concurrently execute a single subsequent instruction. The specified threads include at least a first thread and a second thread. In operation, the first thread arrives at the synchronization instruction. The processor determines that the second thread has not yet arrived at the synchronization instruction and configures the first thread to stop executing instructions. After issuing at least one instruction for the second thread, the processor determines that all the specified threads have arrived at the synchronization instruction. The processor then causes all the specified threads to execute the subsequent instruction. Advantageously, unlike conventional approaches to synchronizing threads, the synchronization instruction enables the processor to reliably and properly execute code that includes complex control flows and/or instructions that presuppose that threads are converged.

4.

发明授权
Techniques for storing sub-alignment data when accelerating Smith-Waterman sequence alignments 有权

公开(公告)号：US11822541B2

公开(公告)日：2023-11-21

申请号：US17491266

申请日：2021-09-30

Applicant: NVIDIA CORPORATION

Inventor： Maciej Piotr Tyrlik , Ajay Sudarshan Tirumala , Shirish Gadre

IPC: G06F16/23 , G16B30/10 , G16B50/30 , G06F16/242

CPC classification number: G06F16/2379 , G06F16/242 , G16B30/10 , G16B50/30

Abstract: Various techniques for accelerating Smith-Waterman sequence alignments are provided. For example, threads in a group of threads are employed to use an interleaved cell layout to store relevant data in registers while computing sub-alignment data for one or more local alignment problems. In another example, specialized instructions that reduce the number of cycles required to compute each sub-alignment score are utilized. In another example, threads are employed to compute sub-alignment data for a subset of columns of one or more local alignment problems while other threads begin computing sub-alignment data based on partial result data received from the preceding threads. After computing a maximum sub-alignment score, a thread stores the maximum sub-alignment score and the corresponding position in global memory.

5.

发明授权
Techniques for comprehensively synchronizing execution threads 有权

公开(公告)号：US10977037B2

公开(公告)日：2021-04-13

申请号：US16595398

申请日：2019-10-07

Applicant: NVIDIA Corporation

Inventor： Ajay Sudarshan Tirumala , Olivier Giroux , Peter Nelson , Jack Choquette

IPC: G06F9/30 , G06F9/38 , G06F9/46

Abstract: In one embodiment, a synchronization instruction causes a processor to ensure that specified threads included within a warp concurrently execute a single subsequent instruction. The specified threads include at least a first thread and a second thread. In operation, the first thread arrives at the synchronization instruction. The processor determines that the second thread has not yet arrived at the synchronization instruction and configures the first thread to stop executing instructions. After issuing at least one instruction for the second thread, the processor determines that all the specified threads have arrived at the synchronization instruction. The processor then causes all the specified threads to execute the subsequent instruction. Advantageously, unlike conventional approaches to synchronizing threads, the synchronization instruction enables the processor to reliably and properly execute code that includes complex control flows and/or instructions that presuppose that threads are converged.

6.

发明授权
Thread-level sleep in a multithreaded architecture 有权

公开(公告)号：US10817295B2

公开(公告)日：2020-10-27

申请号：US15582549

申请日：2017-04-28

Applicant: NVIDIA CORPORATION

Inventor： Olivier Giroux , Peter Nelson , Jack Choquette , Ajay Sudarshan Tirumala

IPC: G06F9/30 , G06F9/38 , G06F9/46

Abstract: A streaming multiprocessor (SM) includes a nanosleep (NS) unit configured to cause individual threads executing on the SM to sleep for a programmer-specified interval of time. For a given thread, the NS unit parses a NANOSLEEP instruction and extracts a sleep time. The NS unit then maps the sleep time to a single bit of a timer and causes the thread to sleep. When the timer bit changes, the sleep time expires, and the NS unit awakens the thread. The thread may then continue executing. The SM also includes a nanotrap (NT) unit configured to issue traps using a similar timing mechanism to that described above. For a given thread, the NT unit parses a NANOTRAP instruction and extracts a trap time. The NT unit then maps the trap time to a single bit of a timer. When the timer bit changes, the NT unit issues a trap.

7.

发明授权
Techniques for comprehensively synchronizing execution threads 有权

公开(公告)号：US10437593B2

公开(公告)日：2019-10-08

申请号：US15499843

申请日：2017-04-27

Applicant: NVIDIA Corporation

Inventor： Ajay Sudarshan Tirumala , Olivier Giroux , Peter Nelson , Jack Choquette

IPC: G06F9/30 , G06F9/38 , G06F9/46

Abstract: A synchronization instruction causes a processor to ensure that specified threads included within a warp concurrently execute a single subsequent instruction. The specified threads include at least a first thread and a second thread. In operation, the first thread arrives at the synchronization instruction. The processor determines that the second thread has not yet arrived at the synchronization instruction and configures the first thread to stop executing instructions. After issuing at least one instruction for the second thread, the processor determines that all the specified threads have arrived at the synchronization instruction. The processor then causes all the specified threads to execute the subsequent instruction. Advantageously, unlike conventional approaches to synchronizing threads, the synchronization instruction enables the processor to reliably and properly execute code that includes complex control flows and/or instructions that presuppose that threads are converged.

8.

发明授权
Techniques for efficiently synchronizing multiple program threads 有权

公开(公告)号：US12271765B2

公开(公告)日：2025-04-08

申请号：US17338377

申请日：2021-06-03

Applicant: NVIDIA CORPORATION

Inventor： Ajay Sudarshan Tirumala , Olivier Giroux , Peter Nelson , Gary M. Tarolli , Ankita Upreti , Konstantinos Kyriakopoulos , Divya Shanmughan , Rishkul Kulkarni

IPC: G06F9/52 , G06F9/30 , G06F9/38 , G06F9/48 , G06F9/50

Abstract: Various embodiments include a parallel processing computer system that enables parallel instances of a program to synchronize at disparate addresses in memory. When the parallel program instances need to exchange data, the program instances synchronize based on a mask that identifies the program instances that are synchronizing. As each program instance reaches the point of synchronization, the program instance blocks and waits for all other program instances to reach the point of synchronization. When all program instances have reached the point of synchronization, at least one program instance executes a synchronous operation to exchange data. The program instances then continue execution at respective and disparate return addresses.

9.

发明授权
Implementing specialized instructions for accelerating dynamic programming algorithms 有权

公开(公告)号：US12141582B2

公开(公告)日：2024-11-12

申请号：US17936172

申请日：2022-09-28

Applicant: NVIDIA CORPORATION

Inventor： Maciej Piotr Tyrlik , Ajay Sudarshan Tirumala , Shirish Gadre , Frank Joseph Eaton , Daniel Alan Stiffler

IPC: G06F9/30 , G06F9/38

Abstract: Various techniques for accelerating dynamic programming algorithms are provided. For example, a fused addition and comparison instruction, a three-operand comparison instruction, and a two-operand comparison instruction are used to accelerate a Needleman-Wunsch algorithm that determines an optimized global alignment of subsequences over two entire sequences. In another example, the fused addition and comparison instruction is used in an innermost loop of a Floyd-Warshall algorithm to reduce the number of instructions required to determine shortest paths between pairs of vertices in a graph. In another example, a two-way single instruction multiple data (SIMD) floating point variant of the three-operand comparison instruction is used to reduce the number of instructions required to determine the median of an array of floating point values.

10.

发明授权
Implementing specialized instructions for accelerating Smith-Waterman sequence alignments 有权

公开(公告)号：US11550584B1

公开(公告)日：2023-01-10

申请号：US17491279

申请日：2021-09-30

Applicant: NVIDIA CORPORATION

Inventor： Maciej Piotr Tyrlik , Ajay Sudarshan Tirumala , Shirish Gadre

IPC: G06F9/30 , G06F9/54

Abstract: Various techniques for accelerating Smith-Waterman sequence alignments are provided. For example, threads in a group of threads are employed to use an interleaved cell layout to store relevant data in registers while computing sub-alignment data for one or more local alignment problems. In another example, specialized instructions that reduce the number of cycles required to compute each sub-alignment score are utilized. In another example, threads are employed to compute sub-alignment data for a subset of columns of one or more local alignment problems while other threads begin computing sub-alignment data based on partial result data received from the preceding threads. After computing a maximum sub-alignment score, a thread stores the maximum sub-alignment score and the corresponding position in global memory.

Search Results

Country/Region

Patent validity

Application date

Publication (announcement) day

applicant

The country/region where the applicant is located

Inventor

IPC

IPC Department

IPC class

IPC subclass

IPC group

IPC team

Appearance classification