-
公开(公告)号:US11726757B2
公开(公告)日:2023-08-15
申请号:US16811068
申请日:2020-03-06
Applicant: Nvidia Corporation
Inventor: William James Dally
CPC classification number: G06F8/451 , G06F9/5066 , G06T1/20
Abstract: The disclosure provides processors that are configured to perform dynamic programming according to an instruction, a method for configuring a processor for dynamic programming according to an instruction and a method of computing a modified Smith Waterman algorithm employing an instruction for configuring a parallel processing unit. In one example, the method for configuring includes: (1) receiving, by execution cores of the processor, an instruction that directs the execution cores to compute a set of recurrence equations employing a matrix, (2) configuring the execution cores, according to the set of recurrence equations, to compute states for elements of the matrix, and (3) storing the computed states for current elements of the matrix in registers of the execution cores, wherein the computed states are determined based on the set of recurrence equations and input data.
-
公开(公告)号:US20230237308A1
公开(公告)日:2023-07-27
申请号:US17814957
申请日:2022-07-26
Applicant: NVIDIA Corporation
Inventor: Charbel Sakr , Steve Haihang Dai , Brucek Kurdo Khailany , William James Dally , Rangharajan Venkatesan , Brian Matthew Zimmer
Abstract: Quantizing tensors and vectors processed within a neural network reduces power consumption and may accelerate processing. Quantization reduces the number of bits used to represent a value, where decreasing the number of bits used can decrease the accuracy of computations that use the value. Ideally, quantization is performed without reducing accuracy. Quantization-aware training (QAT) is performed by dynamically quantizing tensors (weights and activations) using optimal clipping scalars. “Optimal” in that the mean squared error (MSE) of the quantized operation is minimized and the clipping scalars define the degree or amount of quantization for various tensors of the operation. Conventional techniques that quantize tensors during training suffer from high amounts of noise (error). Other techniques compute the clipping scalars offline through a brute force search to provide high accuracy. In contrast, the optimal clipping scalars can be computed online and provide the same accuracy as the clipping scalars computed offline.
-
公开(公告)号:US20230237011A1
公开(公告)日:2023-07-27
申请号:US17581734
申请日:2022-01-21
Applicant: NVIDIA Corporation
Inventor: William James Dally
CPC classification number: G06F15/80 , G06F12/0646 , G06F2212/7201
Abstract: A mapping may be made between an array of physical processors and an array of functional logical processors. Also, a mapping may be made between logical memory channels (associated with the logical processors) and functional physical memory channels (associated with the physical processors). These mappings may be stored within one or more tables, which may then be used to bypass faulty processors and memory channels when implementing memory accesses, while optimizing locality (e.g., by minimizing the proximity of memory channels to processors).
-
公开(公告)号:US20170154667A1
公开(公告)日:2017-06-01
申请号:US15430393
申请日:2017-02-10
Applicant: NVIDIA Corporation
Inventor: William James Dally
IPC: G11C11/4091 , G11C11/4099 , G11C11/408
CPC classification number: G11C11/4091 , G11C11/4063 , G11C11/4085 , G11C11/4087 , G11C11/4099
Abstract: This description is directed to a dynamic random access memory (DRAM) array having a plurality of rows and a plurality of columns. The array further includes a plurality of cells, each of which are associated with one of the columns and one of the rows. Each cell includes a capacitor that is selectively coupled to a bit line of its associate column so as to share charge with the bit line when the cell is selected. There is a segmented word line circuit for each row, which is controllable to cause selection of only a portion of the cells in the row.
-
公开(公告)号:US09460776B2
公开(公告)日:2016-10-04
申请号:US13748499
申请日:2013-01-23
Applicant: NVIDIA Corporation
Inventor: William James Dally
IPC: G11C11/24 , G11C11/412 , G11C11/419 , G11C11/404
CPC classification number: G11C11/4125 , G11C11/404 , G11C11/419
Abstract: The disclosure provides for an SRAM array having a plurality of wordlines and a plurality of bitlines, referred to generally as SRAM lines. The array has a plurality of cells, each cell being defined by an intersection between one of the wordlines and one of the bitlines. The SRAM array further includes voltage boost circuitry operatively coupled with the cells, the voltage boost circuitry being configured to provide an amount of voltage boost that is based on an address of a cell to be accessed and/or to provide this voltage boost on an SRAM line via capacitive charge coupling.
Abstract translation: 本公开提供了具有多个字线和多个位线的SRAM阵列,通常称为SRAM线。 该阵列具有多个单元,每个单元由字线之一和位线之一的交点定义。 所述SRAM阵列还包括与所述单元操作地耦合的升压电路,所述升压电路被配置为提供基于待访问的单元的地址和/或在SRAM上提供该电压升压的一定量的升压电压 线通过电容电荷耦合。
-
公开(公告)号:US20140097813A1
公开(公告)日:2014-04-10
申请号:US13647202
申请日:2012-10-08
Applicant: NVIDIA CORPORATION
Inventor: William James Dally
IPC: G05F1/625
CPC classification number: H02M3/158 , H02M1/15 , H02M3/1582 , H02M3/1588 , H02M2003/1566 , Y02B70/1425
Abstract: Embodiments are disclosed relating to an electric power conversion device and methods for controlling the operation thereof. One disclosed embodiment provides an electric power conversion device comprising a first current control mechanism coupled to an electric power source and an upstream end of an inductor, where the first current control mechanism is operable to control inductor current. The electric power conversion device further comprises a second current control mechanism coupled between the downstream end of the inductor and a load, where the second current control mechanism is operable to control how much of the inductor current is delivered to the load.
Abstract translation: 公开了关于电力转换装置的实施例以及用于控制其操作的方法。 一个公开的实施例提供一种电力转换装置,其包括耦合到电源和电感器的上游端的第一电流控制机构,其中第一电流控制机构可操作以控制电感器电流。 电力转换装置还包括耦合在电感器的下游端和负载之间的第二电流控制机构,其中第二电流控制机构可操作以控制电感器电流传送到负载的电流。
-
公开(公告)号:US12141225B2
公开(公告)日:2024-11-12
申请号:US16750823
申请日:2020-01-23
Applicant: NVIDIA Corporation
Inventor: William James Dally , Rangharajan Venkatesan , Brucek Kurdo Khailany
Abstract: Neural networks, in many cases, include convolution layers that are configured to perform many convolution operations that require multiplication and addition operations. Compared with performing multiplication on integer, fixed-point, or floating-point format values, performing multiplication on logarithmic format values is straightforward and energy efficient as the exponents are simply added. However, performing addition on logarithmic format values is more complex. Conventionally, addition is performed by converting the logarithmic format values to integers, computing the sum, and then converting the sum back into the logarithmic format. Instead, logarithmic format values may be added by decomposing the exponents into separate quotient and remainder components, sorting the quotient components based on the remainder components, summing the sorted quotient components using an asynchronous accumulator to produce partial sums, and multiplying the partial sums by the remainder components to produce a sum. The sum may then be converted back into the logarithmic format.
-
公开(公告)号:US12099453B2
公开(公告)日:2024-09-24
申请号:US17709031
申请日:2022-03-30
Applicant: NVIDIA Corporation
Inventor: William James Dally , Carl Thomas Gray , Stephen W. Keckler , James Michael O'Connor
IPC: G06F13/16 , G11C8/12 , H03K19/1776
CPC classification number: G06F13/161 , G06F13/1673 , G06F13/1689 , G11C8/12 , H03K19/1776
Abstract: Embodiments of the present disclosure relate to application partitioning for locality in a stacked memory system. In an embodiment, one or more memory dies are stacked on the processor die. The processor die includes multiple processing tiles and each memory die includes multiple memory tiles. Vertically aligned memory tiles are directly coupled to and comprise the local memory block for a corresponding processing tile. An application program that operates on dense multi-dimensional arrays (matrices) may partition the dense arrays into sub-arrays associated with program tiles. Each program tile is executed by a processing tile using the processing tile's local memory block to process the associated sub-array. Data associated with each sub-array is stored in a local memory block and the processing tile corresponding to the local memory block executes the program tile to process the sub-array data.
-
公开(公告)号:US12033060B2
公开(公告)日:2024-07-09
申请号:US16750917
申请日:2020-01-23
Applicant: NVIDIA Corporation
Abstract: Neural networks, in many cases, include convolution layers that are configured to perform many convolution operations that require multiplication and addition operations. Compared with performing multiplication on integer, fixed-point, or floating-point format values, performing multiplication on logarithmic format values is straightforward and energy efficient as the exponents are simply added. However, performing addition on logarithmic format values is more complex. Conventionally, addition is performed by converting the logarithmic format values to integers, computing the sum, and then converting the sum back into the logarithmic format. Instead, logarithmic format values may be added by decomposing the exponents into separate quotient and remainder components, sorting the quotient components based on the remainder components, summing the sorted quotient components using an asynchronous accumulator to produce partial sums, and multiplying the partial sums by the remainder components to produce a sum. The sum may then be converted back into the logarithmic format.
-
公开(公告)号:US20240112007A1
公开(公告)日:2024-04-04
申请号:US18537570
申请日:2023-12-12
Applicant: NVIDIA Corporation
Inventor: William James Dally , Rangharajan Venkatesan , Brucek Kurdo Khailany
CPC classification number: G06N3/063 , G06F7/4833 , G06F17/16
Abstract: Neural networks, in many cases, include convolution layers that are configured to perform many convolution operations that require multiplication and addition operations. Compared with performing multiplication on integer, fixed-point, or floating-point format values, performing multiplication on logarithmic format values is straightforward and energy efficient as the exponents are simply added. However, performing addition on logarithmic format values is more complex. Conventionally, addition is performed by converting the logarithmic format values to integers, computing the sum, and then converting the sum back into the logarithmic format. Instead, logarithmic format values may be added by decomposing the exponents into separate quotient and remainder components, sorting the quotient components based on the remainder components, summing the sorted quotient components to produce partial sums, and multiplying the partial sums by the remainder components to produce a sum. The sum may then be converted back into the logarithmic format.
-
-
-
-
-
-
-
-
-