Abstract:
A vector friendly instruction format and execution thereof. According to one embodiment of the invention, a processor is configured to execute an instruction set. The instruction set includes a vector friendly instruction format. The vector friendly instruction format has a plurality of fields including a base operation field, a modifier field, an augmentation operation field, and a data element width field, wherein the first instruction format supports different versions of base operations and different augmentation operations through placement of different values in the base operation field, the modifier field, the alpha field, the beta field, and the data element width field, and wherein only one of the different values may be placed in each of the base operation field, the modifier field, the alpha field, the beta field, and the data element width field on each occurrence of an instruction in the first instruction format in instruction streams.
Abstract:
The present disclosure provides apparatus comprising a silicon interposer, a communication fabric, an accelerator die comprising a plurality of computing elements to simultaneously perform operations on a plurality of matrix data elements. The apparatus comprising a plurality of dot-product engines, the plurality of dot-product engines to compute a plurality of dot products on the matrix data elements to generate a plurality of result matrix data elements, a buffer or cache to store a plurality of matrix data elements a memory controller coupled to the communication fabric and a stacked DRAM that stacks a plurality of DRAM dies vertically on the silicon interposer substrate coupled to the accelerator die.
Abstract:
An apparatus includes a decode unit to decode a permute instruction and a vector conflict instruction. A vector execution unit is coupled with the decode unit and includes a fully-connected interconnect. The fully-connected interconnect has at least four inputs to receive at least four corresponding data elements of at least one source vector. The fully-connected interconnect has at least four outputs. Each of the at least four inputs is coupled with each of the at least four outputs. The execution unit also includes a permute instruction execution logic coupled with the at least four outputs and operable to store a first vector result in response to the permute instruction. The execution unit also includes a vector conflict instruction execution logic coupled with the at least four outputs and operable to store a second vector result in a destination storage location in response to the vector conflict instruction.
Abstract:
The present disclosure provides an apparatus comprising an accelerator; a local memory comprising a plurality of stacked dynamic random access memory, DRAM, dies; a silicon bridge to couple the accelerator to the plurality of stacked DRAM dies, wherein connections between the accelerator and the plurality of stacked DRAM dies run through the silicon bridge. The accelerator comprising a plurality of processing elements to perform processing tasks allocated by an external processor; a cache coherent interface to couple the accelerator to the external processor, the cache coherent interface to ensure that data stored in the local memory and/or an accelerator cache is coherent with data stored in a system memory and caches of the external processor; and logic to map a virtual memory space to heterogeneous forms of physical system memory including the local memory and the system memory, the accelerator and the external processor to both use the virtual memory space to access corresponding portions of the local memory and the system memory.
Abstract:
The present disclosure provides a processor including a processor core. The processor core includes: a decoder to decode at least one instruction native to the processor core; one or more execution units to execute at least one decoded instruction, the at least one decoded instruction corresponding to an acceleration begin instruction, the acceleration begin instruction to indicate a start of a region of code to be offloaded to an accelerator.
Abstract:
The present disclosure provides a processor including a processor core. The processor core includes: a decoder to decode at least one instruction native to the processor core; one or more execution units to execute at least one decoded instruction, the at least one decoded instruction corresponding to an acceleration begin instruction, the acceleration begin instruction to indicate a start of a region of code to be offloaded to an accelerator.
Abstract:
A vector friendly instruction format and execution thereof. According to one embodiment of the invention, a processor is configured to execute an instruction set. The instruction set includes a vector friendly instruction format. The vector friendly instruction format has a plurality of fields including a base operation field, a modifier field, an augmentation operation field, and a data element width field, wherein the first instruction format supports different versions of base operations and different augmentation operations through placement of different values in the base operation field, the modifier field, the alpha field, the beta field, and the data element width field, and wherein only one of the different values may be placed in each of the base operation field, the modifier field, the alpha field, the beta field, and the data element width field on each occurrence of an instruction in the first instruction format in instruction streams.
Abstract:
A vector friendly instruction format and execution thereof. According to one embodiment of the invention, a processor is configured to execute an instruction set. The instruction set includes a vector friendly instruction format. The vector friendly instruction format has a plurality of fields including a base operation field, a modifier field, an augmentation operation field, and a data element width field, wherein the first instruction format supports different versions of base operations and different augmentation operations through placement of different values in the base operation field, the modifier field, the alpha field, the beta field, and the data element width field, and wherein only one of the different values may be placed in each of the base operation field, the modifier field, the alpha field, the beta field, and the data element width field on each occurrence of an instruction in the first instruction format in instruction streams.
Abstract:
The present disclosure provides a method and an apparatus comprising a decoder to decode an enqueue command instruction, execution circuitry, where execution of the enqueue command instruction causes the execution circuitry to: generate a work descriptor based, at least in part, on data from a source operand of the enqueue command instruction, the work descriptor comprising a plurality of fields including an operation field to specify one or more operations to be performed, a flag to indicate whether the work descriptor can be processed in parallel with one or more other work descriptors, and an address field associated with the one or more operations and to store the work descriptor to a work queue.