-
公开(公告)号:US20180188972A1
公开(公告)日:2018-07-05
申请号:US15395427
申请日:2016-12-30
Applicant: Intel Corporation
Inventor: Andrew Yang , Carey K. Kloss , Tony L. Werner , Horace Lau
CPC classification number: G06F3/0611 , G06F3/0646 , G06F3/0658 , G06F3/0673 , G06F11/1048 , G06F17/16 , G11C7/10 , G11C7/1006
Abstract: In one embodiment, an apparatus comprises a memory and a memory controller. The memory comprises a plurality of memory modules, wherein each memory module comprises a plurality of storage locations. The memory controller may be configured to write data of a matrix to the memory. For example, the memory controller may be configured to write a particular row or a particular column of the matrix to the memory by: shifting a plurality of matrix elements of the particular row or the particular column; and writing the plurality of matrix elements to the plurality of memory modules.
-
公开(公告)号:US20170097884A1
公开(公告)日:2017-04-06
申请号:US14874784
申请日:2015-10-05
Applicant: Intel Corporation
Inventor: Tony Werner , Aravind Kalaiah , Andrew Yang , Carey Kloss , Horace Lau , Naveen Gandham Rao , Amir Khosrowshahi
CPC classification number: G06F12/023 , G06F15/76 , G06F2212/251 , G06T1/20
Abstract: Described herein are one or more integrated circuits (ICs) comprising controller circuitry to receive a command to execute an operation for data inputs stored in an external memory or a local memory, and convert the operation into a set of matrix operations to operate on sub-portions of the data inputs. The IC(s) further comprise at least one processing circuitry to execute the set of matrix operations, the processing circuitry to include ALUs, a local memory external to the ALUs and accessible by the ALUs, and processing control circuitry to create at least one matrix operand in the local memory (from the data inputs of the operation) comprising at least one of a scalar, a vector, or a 2D matrix, and provide memory handles corresponding to each of the matrix operands to one of the ALUs to access the respective matrix operands when executing a matrix operation.
-
公开(公告)号:US12205035B2
公开(公告)日:2025-01-21
申请号:US16004243
申请日:2018-06-08
Applicant: Intel Corporation
Inventor: Krishnakumar Nair , Andrew Yang , Brian Morris
Abstract: Thus, the present disclosure is directed to systems and methods for training neural networks using a tensor that includes a plurality of FP16 values and a plurality of bits that define an exponent shared by some or all of the FP16 values included in the tensor. The FP16 values may include IEEE 754 format 16-bit floating point values and the tensor may include a plurality of bits defining the shared exponent. The tensor may include a shared exponent and FP16 values that include a variable bit-length mantissa and a variable bit-length exponent that may be dynamically set by processor circuitry. The tensor may include a shared exponent and FP16 values that include a variable bit-length mantissa; a variable bit-length exponent that may be dynamically set by processor circuitry; and a shared exponent switch set by the processor circuitry to selectively combine the FP16 value exponent with the shared exponent.
-
公开(公告)号:US20240028905A1
公开(公告)日:2024-01-25
申请号:US18478554
申请日:2023-09-29
Applicant: Intel Corporation
Inventor: Krishnakumar Nair , Andrew Yang , Brian Morris
CPC classification number: G06N3/084 , G06N3/063 , G06N3/045 , G06F9/3013
Abstract: Thus, the present disclosure is directed to systems and methods for training neural networks using a tensor that includes a plurality of FP16 values and a plurality of bits that define an exponent shared by some or all of the FP16 values included in the tensor. The FP16 values may include IEEE 754 format 16-bit floating point values and the tensor may include a plurality of bits defining the shared exponent. The tensor may include a shared exponent and FP16 values that include a variable bit-length mantissa and a variable bit-length exponent that may be dynamically set by processor circuitry. The tensor may include a shared exponent and FP16 values that include a variable bit-length mantissa; a variable bit-length exponent that may be dynamically set by processor circuitry; and a shared exponent switch set by the processor circuitry to selectively combine the FP16 value exponent with the shared exponent.
-
公开(公告)号:US10761757B2
公开(公告)日:2020-09-01
申请号:US16024812
申请日:2018-06-30
Applicant: INTEL CORPORATION
Inventor: Krishnakumar Nair , Andrew Yang , Michael Rotzin , Nitin Garegrat , Tom Schebye , Tony Werner
Abstract: An apparatus and method for a converting tensor data. For example, one embodiment of a method comprises: fetching source tensor blocks of a source tensor data structure, each source tensor block comprising a plurality of source tensor data elements having a first numeric representation, wherein the source tensor data structure comprises a predefined structural arrangement of source tensor blocks; converting the one or more source tensor blocks into one or more destination tensor blocks comprising a plurality of destination tensor data elements having a second numeric representation different from the first numeric representation, wherein the sets of one or more source tensor blocks are converted to one or more corresponding destination tensor blocks in a specified order based on the first and second numeric representations; and storing each individual destination tensor block in a designated memory region to maintain coherency with the predefined structural arrangement of the source tensor blocks.
-
公开(公告)号:US09886377B2
公开(公告)日:2018-02-06
申请号:US14874784
申请日:2015-10-05
Applicant: Intel Corporation
Inventor: Tony Werner , Aravind Kalaiah , Andrew Yang , Carey Kloss , Horace Lau , Naveen Gandham Rao , Amir Khosrowshahi
CPC classification number: G06F12/023 , G06F15/76 , G06F2212/251 , G06T1/20
Abstract: Described herein are one or more integrated circuits (ICs) comprising controller circuitry to receive a command to execute an operation for data inputs stored in an external memory or a local memory, and convert the operation into a set of matrix operations to operate on sub-portions of the data inputs. The IC(s) further comprise at least one processing circuitry to execute the set of matrix operations, the processing circuitry to include ALUs, a local memory external to the ALUs and accessible by the ALUs, and processing control circuitry to create at least one matrix operand in the local memory (from the data inputs of the operation) comprising at least one of a scalar, a vector, or a 2D matrix, and provide memory handles corresponding to each of the matrix operands to one of the ALUs to access the respective matrix operands when executing a matrix operation.
-
公开(公告)号:US12254061B2
公开(公告)日:2025-03-18
申请号:US17256195
申请日:2018-09-27
Applicant: Intel Corporation
Inventor: Maciej Urbanski , Brian J. Hickmann , Michael Rotzin , Krishnakumar Nair , Andrew Yang , Brian S. Morris , Dennis Bradford
Abstract: Methods and apparatuses relating to performing vector multiplication are described. Hardware accelerators to perform vector multiplication are also described. In one embodiment, a combined fixed-point and floating-point vector multiplication circuit includes at least one switch to change the circuit between a first mode and a second mode, where in the first mode, each multiplier of a set of multipliers is to multiply mantissas from a same element position of a first floating-point vector and a second floating-point vector to produce a corresponding product, shift the corresponding products with a set of shift registers based on a maximum exponent of exponents for the corresponding products determined by a maximum exponent determiner to produce shifted products, perform an numeric conversion operation on the shifted products with a set of numeric conversion circuits based on sign bits from the same element position of the first floating-point vector and the second floating-point vector to produce signed representations of the shifted products, add the signed representations of the shifted products with a set of adders to produce a single product, and normalize the single product with a normalization circuit based on the maximum exponent into a single floating-point resultant, and in the second mode, each multiplier of the set of multipliers is to multiply values from a same element position of a first integer vector and a second integer vector to produce a corresponding product, and add each corresponding product with the set of adders to produce a single integer resultant.
-
公开(公告)号:US20240112006A1
公开(公告)日:2024-04-04
申请号:US18534566
申请日:2023-12-08
Applicant: Intel Corporation
Inventor: Horace H. Lau , Prashant Arora , Olivia K. Wu , Tony L. Werner , Carey K. Kloss , Amir Khosrowshahi , Andrew Yang , Aravind Kalaiah , Vijay Anand R. Korthikanti
Abstract: A network of matrix processing units (MPUs) is provided on a device, where each MPU is connected to at least one other MPU in the network, and each MPU is to perform matrix multiplication operations. Computer memory stores tensor data and a master control central processing unit (MCC) is provided on the device to receive an instruction from a host device, where the instruction includes one or more tensor operands based on the tensor data. The MCC invokes a set of operations on one or more of the MPUs based on the instruction, where the set of operations includes operations on the tensor operands. A result is generated from the set of operations, the result embodied as a tensor value.
-
公开(公告)号:US11567555B2
公开(公告)日:2023-01-31
申请号:US16557657
申请日:2019-08-30
Applicant: Intel Corporation
Inventor: Jason Seung-Min Kim , Sundar Ramani , Yogesh Bansal , Nitin N. Garegrat , Olivia K. Wu , Mayank Kaushik , Mrinal Iyer , Tom Schebye , Andrew Yang
Abstract: Embodiments include an apparatus comprising an execution unit coupled to a memory, a microcode controller, and a hardware controller. The microcode controller is to identify a global power and performance hint in an instruction stream that includes first and second instruction phases to be executed in parallel, identify a local hint based on synchronization dependence in the first instruction phase, and use the first local hint to balance power consumption between the execution unit and the memory during parallel executions of the first and second instruction phases. The hardware controller is to use the global hint to determine an appropriate voltage level of a compute voltage and a frequency of a compute clock signal for the execution unit during the parallel executions of the first and second instruction phases. The first local hint includes a processing rate for the first instruction phase or an indication of the processing rate.
-
公开(公告)号:US20190392297A1
公开(公告)日:2019-12-26
申请号:US16474029
申请日:2017-12-28
Applicant: Intel Corporation
Inventor: Horace H. Lau , Prashant Arora , Olivia K. Wu , Tony Werner , Carey K. Kloss , Amir Khosrowshahi , Andrew Yang , Aravind Kalaiah , Vijay Anand R. Korthikanti
Abstract: A network of matrix processing units (MPUs) is provided on a device, where each MPU is connected to at least one other MPU in the network, and each MPU is to perform matrix multiplication operations. Computer memory stores tensor data and a master control central processing unit (MCC) is provided on the device to receive an instruction from a host device, where the instruction includes one or more tensor operands based on the tensor data. The MCC invokes a set of operations on one or more of the MPUs based on the instruction, where the set of operations includes operations on the tensor operands. A result is generated from the set of operations, the result embodied as a tensor value.
-
-
-
-
-
-
-
-
-