-
公开(公告)号:US20220214877A1
公开(公告)日:2022-07-07
申请号:US17704690
申请日:2022-03-25
Applicant: Intel Corporation
Inventor: Dipankar DAS , Naveen K. MELLEMPUDI , Mrinmay DUTTA , Arun KUMAR , Dheevatsa MUDIGERE , Abhisek KUNDU
Abstract: Disclosed embodiments relate to instructions for fused multiply-add (FMA) operations with variable-precision inputs. In one example, a processor to execute an asymmetric FMA instruction includes fetch circuitry to fetch an FMA instruction having fields to specify an opcode, a destination, and first and second source vectors having first and second widths, respectively, decode circuitry to decode the fetched FMA instruction, and a single instruction multiple data (SIMD) execution circuit to process as many elements of the second source vector as fit into an SIMD lane width by multiplying each element by a corresponding element of the first source vector, and accumulating a resulting product with previous contents of the destination, wherein the SIMD lane width is one of 16 bits, 32 bits, and 64 bits, the first width is one of 4 bits and 8 bits, and the second width is one of 1 bit, 2 bits, and 4 bits.
-
12.
公开(公告)号:US20200257527A1
公开(公告)日:2020-08-13
申请号:US16735381
申请日:2020-01-06
Applicant: Intel Corporation
Inventor: Dipankar DAS , Naveen K. MELLEMPUDI , Mrinmay DUTTA , Arun KUMAR , Dheevatsa MUDIGERE , Abhisek KUNDU
Abstract: Disclosed embodiments relate to instructions for fused multiply-add (FMA) operations with variable-precision inputs. In one example, a processor to execute an asymmetric FMA instruction includes fetch circuitry to fetch an FMA instruction having fields to specify an opcode, a destination, and first and second source vectors having first and second widths, respectively, decode circuitry to decode the fetched FMA instruction, and a single instruction multiple data (SIMD) execution circuit to process as many elements of the second source vector as fit into an SIMD lane width by multiplying each element by a corresponding element of the first source vector, and accumulating a resulting product with previous contents of the destination, wherein the SIMD lane width is one of 16 bits, 32 bits, and 64 bits, the first width is one of 4 bits and 8 bits, and the second width is one of 1 bit, 2 bits, and 4 bits.
-
公开(公告)号:US20190243651A1
公开(公告)日:2019-08-08
申请号:US16317501
申请日:2016-09-27
Applicant: INTEL CORPORATION
Inventor: Swagath VENKATARAMANI , Dipankar DAS , Sasikanth AVANCHA , Ashish RANJAN , Subarno BANERJEE , Bharat KAUL , Anand RAGHUNATHAN
IPC: G06F9/30
CPC classification number: G06F9/30145 , G06F9/3004 , G06F9/30043 , G06F9/30087 , G06F9/3834 , G06F9/52 , G06N3/063 , G06N3/084
Abstract: Systems, methods, and apparatuses relating to access synchronization in a shared memory are described. In one embodiment, a processor includes a decoder to decode an instruction into a decoded instruction, and an execution unit to execute the decoded instruction to: receive a first input operand of a memory address to be tracked and a second input operand of an allowed sequence of memory accesses to the memory address, and cause a block of a memory access that violates the allowed sequence of memory accesses to the memory address. In one embodiment, a circuit separate from the execution unit compares a memory address for a memory access request to one or more memory addresses in a tracking table, and blocks a memory access for the memory access request when a type of access violates a corresponding allowed sequence of memory accesses to the memory address for the memory access request.
-
公开(公告)号:US20180314940A1
公开(公告)日:2018-11-01
申请号:US15869515
申请日:2018-01-12
Applicant: Intel Corporation
Inventor: Abhisek KUNDU , NAVEEN MELLEMPUDI , DHEEVATSA MUDIGERE , Dipankar DAS
Abstract: One embodiment provides for a computing device comprising a parallel processor compute unit to perform a set of parallel integer compute operations; a ternarization unit including a weight ternarization circuit and an activation quantization circuit; wherein the weight ternarization circuit is to convert a weight tensor from a floating-point representation to a ternary representation including a ternary weight and a scale factor; wherein the activation quantization circuit is to convert an activation tensor from a floating-point representation to an integer representation; and wherein the parallel processor compute unit includes one or more circuits to perform the set of parallel integer compute operations on the ternary representation of the weight tensor and the integer representation of the activation tensor.
-
-
-