-
公开(公告)号:US20170329605A1
公开(公告)日:2017-11-16
申请号:US15668508
申请日:2017-08-03
申请人: Intel Corporation
发明人: ELMOUSTAPHA OULD-AHMED-VALL , ROBERT VALENTINE , JESUS CORBAL SAN ADRIAN , BRET L. TOLL , MARK J. CHARNEY , ZEEV SPERBER , AMIT GRADSTEIN
CPC分类号: G06F9/30181 , G06F9/30018 , G06F9/30032 , G06F9/30036 , G06F9/3013 , G06F9/30167 , G06F9/3802 , G06F12/0615
摘要: An apparatus is described having instruction execution logic circuitry to execute first, second, third and fourth instruction. Both the first instruction and the second instruction insert a first group of input vector elements to one of multiple first non overlapping sections of respective first and second resultant vectors. The first group has a first bit width. Each of the multiple first non overlapping sections have a same bit width as the first group. Both the third instruction and the fourth instruction insert a second group of input vector elements to one of multiple second non overlapping sections of respective third and fourth resultant vectors. The second group has a second bit width that is larger than said first bit width. Each of the multiple second non overlapping sections have a same bit width as the second group. The apparatus also includes masking layer circuitry to mask the first and third instructions at a first resultant vector granularity, and, mask the second and fourth instructions at a second resultant vector granularity.
-
公开(公告)号:US20160188532A1
公开(公告)日:2016-06-30
申请号:US14583636
申请日:2014-12-27
申请人: INTEL CORPORATION
发明人: ELMOUSTAPHA OULD-AHMED-VALL , JESUS CORBAL SAN ADRIAN , ROBERT VALENTINE , MARK J. CHARNEY , GUILLEM SOLE , ROGER ESPASA
CPC分类号: G06F15/8084 , G06F9/30018 , G06F9/30032 , G06F9/30036 , G06F9/3012 , G06F15/8076
摘要: An apparatus and method for performing a vector bit shuffle. For example, one embodiment of a processor comprises: a first vector register to store a plurality of source data elements; a second vector register to store a plurality of control elements, each of the control elements comprising a plurality of bit fields, each bit field to be associated with a corresponding bit position in a destination mask register and to identify a bit from each of the source data elements to be copied to each of the particular bit positions; and vector bit shuffle logic to read each bit field from the second vector register to identify a bit from each of the source data elements and to responsively copy the bit from each of the source data elements to each of the corresponding bit positions in the destination mask register.
摘要翻译: 用于执行向量比特洗牌的装置和方法。 例如,处理器的一个实施例包括:第一向量寄存器,用于存储多个源数据元素; 用于存储多个控制元件的第二矢量寄存器,每个控制元件包括多个位域,每个位域与目的地掩模寄存器中的对应位位置相关联,并且从源中的每一个识别位 要复制到每个特定位位置的数据元素; 和向量位洗牌逻辑,以从第二向量寄存器读取每个位字段,以识别来自每个源数据元素的位,并且响应地将每个源数据元素中的位复制到目标掩码中的每个相应位位置 寄存器。
-
公开(公告)号:US20160188335A1
公开(公告)日:2016-06-30
申请号:US14583639
申请日:2014-12-27
申请人: INTEL CORPORATION
发明人: ELMOUSTAPHA OULD-AHMED-VALL , ROBERT VALENTINE , JESUS CORBAL SAN ADRIAN , MARK J. CHARNEY , GUILLEM SOLE , ROGER ESPASA
IPC分类号: G06F9/30
CPC分类号: G06F9/30036 , G06F9/30018 , G06F9/30032 , G06F9/30098
摘要: An apparatus and method for performing a vector bit gather. For example, one embodiment of a processor comprises: a first vector register to store one or more source data elements; a second vector register to store one or more control elements, each of the control elements comprising a plurality of bit fields, each bit field to be associated with a corresponding bit position in a destination vector register and to identify a bit from the one or more source data elements to be copied to each of the particular bit positions; and vector bit gather logic to read each bit field from the second vector register to identify a bit from the one or more source data elements and to responsively copy the bit from each of the one or more source data elements to each of the corresponding bit positions in the destination vector register.
摘要翻译: 用于执行向量位聚合的装置和方法。 例如,处理器的一个实施例包括:第一向量寄存器,用于存储一个或多个源数据元素; 第二矢量寄存器,用于存储一个或多个控制元件,每个控制元件包括多个位域,每个位字段将与目的地向量寄存器中的相应位位置相关联,并且从一个或多个位 要复制到每个特定位位置的源数据元素; 和向量位采集逻辑,以从第二向量寄存器读取每个位域,以识别来自一个或多个源数据元素的位,并且响应地将该一个或多个源数据元素中的每个源的位复制到相应的位位置 在目的向量寄存器中。
-
公开(公告)号:US20160179548A1
公开(公告)日:2016-06-23
申请号:US14580055
申请日:2014-12-22
申请人: Intel Corporation
CPC分类号: G06F9/30185 , G06F7/76 , G06F7/764 , G06F9/30018 , G06F9/30032
摘要: In one embodiment a processing device implements a set of instructions to perform an inverse centrifuge operation using vector or general purpose registers. The inverse centrifuge operation interleaves bits from opposite regions of a source and writes the interleaved bits to a destination. The instructions use a control mask where each bit with a mask value of one is obtained from one side of the source register or vector elements with a mask of zero are obtained from the opposing side.
摘要翻译: 在一个实施例中,处理装置实现一组指令以使用向量或通用寄存器来执行逆离心机操作。 反向离心机操作从源的相对区域交错比特,并将交错比特写入目的地。 指令使用控制掩码,其中从源寄存器的一侧获得具有掩码值为1的每个位或从相对侧获得具有零掩蔽的向量元素。
-
公开(公告)号:US20160179528A1
公开(公告)日:2016-06-23
申请号:US14581607
申请日:2014-12-23
申请人: INTEL CORPORATION
IPC分类号: G06F9/30
CPC分类号: G06F9/30036 , G06F9/30018 , G06F9/30021 , G06F9/30047 , G06F9/30112 , G06F9/3834 , G06F9/3838
摘要: An apparatus and method are described for performing conflict detection operations. For example, one embodiment of a processor comprises: a first source vector register to store a first set of data elements; a second source vector register to store a second set of data elements; conflict detection logic to perform a specified comparison operation comparing each of the first set of data elements with specified data elements from the second set and generating a set of comparison results, the comparison operation to be selected from a group consisting of a greater than comparison, a less than comparison, a greater than or equal to comparison, a less than or equal to comparison, and a not equal to comparison.
摘要翻译: 描述了用于执行冲突检测操作的装置和方法。 例如,处理器的一个实施例包括:第一源向量寄存器,用于存储第一组数据元素; 第二源向量寄存器,用于存储第二组数据元素; 冲突检测逻辑,用于执行指定的比较操作,将第一组数据元素与来自第二组的指定数据元素进行比较,并生成一组比较结果,从大于比较的组中选择的比较操作, 小于比较,大于或等于比较,小于或等于比较,不等于比较。
-
36.
公开(公告)号:US20230409732A1
公开(公告)日:2023-12-21
申请号:US18357066
申请日:2023-07-21
申请人: Intel Corporation
CPC分类号: G06F21/6227 , G06F16/27 , G06F21/6254 , G06F21/70 , G06F9/30036 , G06F9/30018 , G06F9/30032 , G06F9/30101 , G06F9/3802
摘要: An apparatus is described that includes an execution unit to execute a first instruction and a second instruction. The execution unit includes input register space to store a first data structure to be replicated when executing the first instruction and to store a second data structure to be replicated when executing the second instruction. The first and second data structures are both packed data structures. Data values of the first packed data structure are twice as large as data values of the second packed data structure. The execution unit also includes replication logic circuitry to replicate the first data structure when executing the first instruction to create a first replication data structure, and, to replicate the second data structure when executing the second data instruction to create a second replication data structure. The execution unit also includes masking logic circuitry to mask the first replication data structure at a first granularity and mask the second replication data structure at a second granularity. The second granularity is twice as fine as the first granularity.
-
公开(公告)号:US20230029176A1
公开(公告)日:2023-01-26
申请号:US17868448
申请日:2022-07-19
申请人: Intel Corporation
发明人: JOYDEEP RAY , ARAVINDH ANANTARAMAN , ABHISHEK R. APPU , ALTUG KOKER , ELMOUSTAPHA OULD-AHMED-VALL , VALENTIN ANDREI , SUBRAMANIAM MAIYURAN , NICOLAS GALOPPO VON BORRIES , VARGHESE GEORGE , MIKE MACPHERSON , BEN ASHBAUGH , MURALI RAMADOSS , VIKRANTH VEMULAPALLI , WILLIAM SADLER , JONATHAN PEARCE , SUNGYE KIM
摘要: Methods and apparatus relating to scalar core integration in a graphics processor. In an example, an apparatus comprises a processor to receive a set of workload instructions for a graphics workload from a host complex, determine a first subset of operations in the set of operations that is suitable for execution by a scalar processor complex of the graphics processing device and a second subset of operations in the set of operations that is suitable for execution by a vector processor complex of the graphics processing device, assign the first subset of operations to the scalar processor complex for execution to generate a first set of outputs, assign the second subset of operations to the vector processor complex for execution to generate a second set of outputs. Other embodiments are also disclosed and claimed.
-
38.
公开(公告)号:US20220206989A1
公开(公告)日:2022-06-30
申请号:US17134129
申请日:2020-12-24
申请人: Intel Corporation
摘要: Systems, methods, and apparatuses relating to one or more instructions for loading a tile of a matrix operations accelerator are described. In one embodiment, a system includes a matrix operations accelerator circuit comprising a two-dimensional grid of processing elements, a plurality of registers that represents a two-dimensional matrix coupled to the two-dimensional grid of processing elements, and a coupling to a cache; and a hardware processor core coupled to the matrix operations accelerator circuit and comprising a vector register, a decoder circuit to decode a single instruction into a decoded instruction, the single instruction including a first field that identifies the two-dimensional matrix, a second field that identifies a location in the cache, and a third field that identifies the vector register, and an opcode that indicates an execution circuit of the hardware processor core is to load elements into the plurality of registers that represents the two-dimensional matrix from the location in the cache by the coupling to the cache, and load one or more elements from the vector register into the plurality of registers that represents the two-dimensional matrix by a coupling of the hardware processor core to the matrix operations accelerator circuit that is separate from the coupling to the cache, and the execution circuit of the hardware processor core to execute the decoded instruction according to the opcode.
-
39.
公开(公告)号:US20220206800A1
公开(公告)日:2022-06-30
申请号:US17134136
申请日:2020-12-24
申请人: Intel Corporation
摘要: Systems, methods, and apparatuses relating to one or more instructions for row or column aligning of a tile of a matrix operations accelerator are described. In one embodiment, a system includes a matrix operations accelerator circuit comprising a two-dimensional grid of processing elements, a first plurality of registers that represents a first two-dimensional matrix coupled to the two-dimensional grid of processing elements, and a second plurality of registers that represents a second two-dimensional matrix coupled to the two-dimensional grid of processing elements; and a hardware processor core coupled to the matrix operations accelerator circuit and comprising a decoder circuit to decode a single instruction into a decoded instruction, the single instruction including a first field that identifies the first two-dimensional matrix, a second field that identifies the second two-dimensional matrix, and an opcode that indicates an execution circuit of the hardware processor core is to cause a third two-dimensional matrix to be logically formed for input into the two-dimensional grid of processing elements from the first two-dimensional matrix and the second two-dimensional matrix without moving data elements within the first plurality of registers and the second plurality of registers, and the execution circuit of the hardware processor core to execute the decoded instruction according to the opcode.
-
40.
公开(公告)号:US20210294604A1
公开(公告)日:2021-09-23
申请号:US17226986
申请日:2021-04-09
申请人: Intel Corporation
发明人: VENKATESWARA MADDURI , ELMOUSTAPHA OULD-AHMED-VALL , MARK CHARNEY , ROBERT VALENTINE , JESUS CORBAL , BINWEI YANG
IPC分类号: G06F9/30
摘要: An apparatus and method for performing dual concurrent multiplications of packed data elements. For example one embodiment of a processor comprises: a decoder to decode a first instruction to generate a decoded instruction; a first source register to store a first plurality of packed doubleword data elements; a second source register to store a second plurality of packed doubleword data elements; and execution circuitry to execute the decoded instruction, the execution circuitry comprising: multiplier circuitry to multiply a first doubleword data element from the first source register with a second doubleword data element from the second source register to generate a first quadword product and to concurrently multiply a third doubleword data element from the first source register with a fourth doubleword data element from the second source register to generate a second quadword product; and a destination register to store the first quadword product and the second quadword product as first and second packed quadword data elements.
-
-
-
-
-
-
-
-
-