PIPELINED CONVOLUTIONAL OPERATIONS FOR PROCESSING CLUSTERS

    公开(公告)号:US20170097884A1

    公开(公告)日:2017-04-06

    申请号:US14874784

    申请日:2015-10-05

    CPC classification number: G06F12/023 G06F15/76 G06F2212/251 G06T1/20

    Abstract: Described herein are one or more integrated circuits (ICs) comprising controller circuitry to receive a command to execute an operation for data inputs stored in an external memory or a local memory, and convert the operation into a set of matrix operations to operate on sub-portions of the data inputs. The IC(s) further comprise at least one processing circuitry to execute the set of matrix operations, the processing circuitry to include ALUs, a local memory external to the ALUs and accessible by the ALUs, and processing control circuitry to create at least one matrix operand in the local memory (from the data inputs of the operation) comprising at least one of a scalar, a vector, or a 2D matrix, and provide memory handles corresponding to each of the matrix operands to one of the ALUs to access the respective matrix operands when executing a matrix operation.

    Artificial neural network training using flexible floating point tensors

    公开(公告)号:US12205035B2

    公开(公告)日:2025-01-21

    申请号:US16004243

    申请日:2018-06-08

    Abstract: Thus, the present disclosure is directed to systems and methods for training neural networks using a tensor that includes a plurality of FP16 values and a plurality of bits that define an exponent shared by some or all of the FP16 values included in the tensor. The FP16 values may include IEEE 754 format 16-bit floating point values and the tensor may include a plurality of bits defining the shared exponent. The tensor may include a shared exponent and FP16 values that include a variable bit-length mantissa and a variable bit-length exponent that may be dynamically set by processor circuitry. The tensor may include a shared exponent and FP16 values that include a variable bit-length mantissa; a variable bit-length exponent that may be dynamically set by processor circuitry; and a shared exponent switch set by the processor circuitry to selectively combine the FP16 value exponent with the shared exponent.

    ARTIFICIAL NEURAL NETWORK TRAINING USING FLEXIBLE FLOATING POINT TENSORS

    公开(公告)号:US20240028905A1

    公开(公告)日:2024-01-25

    申请号:US18478554

    申请日:2023-09-29

    CPC classification number: G06N3/084 G06N3/063 G06N3/045 G06F9/3013

    Abstract: Thus, the present disclosure is directed to systems and methods for training neural networks using a tensor that includes a plurality of FP16 values and a plurality of bits that define an exponent shared by some or all of the FP16 values included in the tensor. The FP16 values may include IEEE 754 format 16-bit floating point values and the tensor may include a plurality of bits defining the shared exponent. The tensor may include a shared exponent and FP16 values that include a variable bit-length mantissa and a variable bit-length exponent that may be dynamically set by processor circuitry. The tensor may include a shared exponent and FP16 values that include a variable bit-length mantissa; a variable bit-length exponent that may be dynamically set by processor circuitry; and a shared exponent switch set by the processor circuitry to selectively combine the FP16 value exponent with the shared exponent.

    Apparatus and method for coherent, accelerated conversion between data representations

    公开(公告)号:US10761757B2

    公开(公告)日:2020-09-01

    申请号:US16024812

    申请日:2018-06-30

    Abstract: An apparatus and method for a converting tensor data. For example, one embodiment of a method comprises: fetching source tensor blocks of a source tensor data structure, each source tensor block comprising a plurality of source tensor data elements having a first numeric representation, wherein the source tensor data structure comprises a predefined structural arrangement of source tensor blocks; converting the one or more source tensor blocks into one or more destination tensor blocks comprising a plurality of destination tensor data elements having a second numeric representation different from the first numeric representation, wherein the sets of one or more source tensor blocks are converted to one or more corresponding destination tensor blocks in a specified order based on the first and second numeric representations; and storing each individual destination tensor block in a designated memory region to maintain coherency with the predefined structural arrangement of the source tensor blocks.

    Pipelined convolutional operations for processing clusters

    公开(公告)号:US09886377B2

    公开(公告)日:2018-02-06

    申请号:US14874784

    申请日:2015-10-05

    CPC classification number: G06F12/023 G06F15/76 G06F2212/251 G06T1/20

    Abstract: Described herein are one or more integrated circuits (ICs) comprising controller circuitry to receive a command to execute an operation for data inputs stored in an external memory or a local memory, and convert the operation into a set of matrix operations to operate on sub-portions of the data inputs. The IC(s) further comprise at least one processing circuitry to execute the set of matrix operations, the processing circuitry to include ALUs, a local memory external to the ALUs and accessible by the ALUs, and processing control circuitry to create at least one matrix operand in the local memory (from the data inputs of the operation) comprising at least one of a scalar, a vector, or a 2D matrix, and provide memory handles corresponding to each of the matrix operands to one of the ALUs to access the respective matrix operands when executing a matrix operation.

    Apparatuses and methods to accelerate matrix multiplication

    公开(公告)号:US12254061B2

    公开(公告)日:2025-03-18

    申请号:US17256195

    申请日:2018-09-27

    Abstract: Methods and apparatuses relating to performing vector multiplication are described. Hardware accelerators to perform vector multiplication are also described. In one embodiment, a combined fixed-point and floating-point vector multiplication circuit includes at least one switch to change the circuit between a first mode and a second mode, where in the first mode, each multiplier of a set of multipliers is to multiply mantissas from a same element position of a first floating-point vector and a second floating-point vector to produce a corresponding product, shift the corresponding products with a set of shift registers based on a maximum exponent of exponents for the corresponding products determined by a maximum exponent determiner to produce shifted products, perform an numeric conversion operation on the shifted products with a set of numeric conversion circuits based on sign bits from the same element position of the first floating-point vector and the second floating-point vector to produce signed representations of the shifted products, add the signed representations of the shifted products with a set of adders to produce a single product, and normalize the single product with a normalization circuit based on the maximum exponent into a single floating-point resultant, and in the second mode, each multiplier of the set of multipliers is to multiply values from a same element position of a first integer vector and a second integer vector to produce a corresponding product, and add each corresponding product with the set of adders to produce a single integer resultant.

    Software assisted power management

    公开(公告)号:US11567555B2

    公开(公告)日:2023-01-31

    申请号:US16557657

    申请日:2019-08-30

    Abstract: Embodiments include an apparatus comprising an execution unit coupled to a memory, a microcode controller, and a hardware controller. The microcode controller is to identify a global power and performance hint in an instruction stream that includes first and second instruction phases to be executed in parallel, identify a local hint based on synchronization dependence in the first instruction phase, and use the first local hint to balance power consumption between the execution unit and the memory during parallel executions of the first and second instruction phases. The hardware controller is to use the global hint to determine an appropriate voltage level of a compute voltage and a frequency of a compute clock signal for the execution unit during the parallel executions of the first and second instruction phases. The first local hint includes a processing rate for the first instruction phase or an indication of the processing rate.

Patent Agency Ranking