Abstract:
Instructions and logic provide SIMD secure hashing round slice functionality. Some embodiments include a processor comprising: a decode stage to decode an instruction for a SIMD secure hashing algorithm round slice, the instruction specifying a source data operand set, a message-plus-constant operand set, a round-slice portion of the secure hashing algorithm round, and a rotator set portion of rotate settings. Processor execution units, are responsive to the decoded instruction, to perform a secure hashing round-slice set of round iterations upon the source data operand set, applying the message-plus-constant operand set and the rotator set, and store a result of the instruction in a SIMD destination register. One embodiment of the instruction specifies a hash round type as one of four MD5 round types. Other embodiments may specify a hash round type by an immediate operand as one of three SHA-1 round types or as a SHA-2 round type.
Abstract:
An apparatus is described that includes an execution unit within an instruction pipeline. The execution unit has multiple stages of a circuit that includes a) and b) as follows. a) a first logic circuitry section having multiple mix logic sections each having: i) a first input to receive a first quad word and a second input to receive a second quad word; ii) an adder having a pair of inputs that are respectively coupled to the first and second inputs; iii) a rotator having a respective input coupled to the second input; iv) an XOR gate having a first input coupled to an output of the adder and a second input coupled to an output of the rotator. b) permute logic circuitry having inputs coupled to the respective adder and XOR gate outputs of the multiple mix logic sections.
Abstract:
Methods and apparatus relating to an instruction and/or micro-architecture support for decompression on core are described. In an embodiment, decode circuitry decodes a decompression instruction into a first micro operation and a second micro operation. The first micro operation causes one or more load operations to fetch data into one or more cachelines of a cache of a processor core. Decompression Engine (DE) circuitry decompresses the fetched data from the one or more cachelines of the cache of the processor core in response to the second micro operation. Other embodiments are also disclosed and claimed.
Abstract:
An embodiment of an integrated circuit may comprise, coupled to a core, a hardware decompression accelerator, a compressed cache, a processor and communicatively coupled to the hardware decompression accelerator and the compressed cache, and memory and communicatively coupled to the processor, wherein the memory stores microcode instructions which when executed by the processor causes the processor to store a first address to a decompression work descriptor, retrieve a second address where a compressed page is stored in the compressed cache from the decompression work descriptor at the first address in response to an indication of a page fault, and send instructions to the hardware decompression accelerator to decompress the compressed page at the second address. Other embodiments are disclosed and claimed.
Abstract:
Methods and apparatus relating to an Application Programming Interface (API) for fine grained low latency decompression within a processor core are described. In an embodiment, a decompression Application Programming Interface (API) receives an input handle to a data object. The data object includes compressed data and metadata. Decompression Engine (DE) circuitry decompresses the compressed data to generate uncompressed data. The DE circuitry decompress the compressed data in response to invocation of a decompression instruction by the decompression API. The metadata comprises a first operand to indicate a location of the compressed data, a second operand to indicate a size of the compressed data, a third operand to indicate a location to which decompressed data by the DE circuitry is to be stored, and a fourth operand to indicate a size of the decompressed data. Other embodiments are also disclosed and claimed.
Abstract:
An embodiment of an apparatus comprises decode circuitry to decode a single instruction, the single instruction to include respective fields for one or more source operands, one or more destination operands, and an opcode, the opcode to indicate execution circuitry is to perform a fused addition and subtraction operation, and execution circuitry to execute the decoded instruction according to the opcode to retrieve data from one or more locations indicated by the one or more source operands, to perform the fused addition and subtraction operation indicated by the opcode on three or more arguments indicated by the retrieved data to produce one or more results. Other embodiments are disclosed and claimed.
Abstract:
Systems, methods, and apparatuses relating to performing hashing operations on packed data elements are described. In one embodiment, a processor includes a decode circuit to decode a single instruction into a decoded single instruction, the single instruction including at least one first field that identifies eight 32-bit state elements A, B, C, D, E, F, G, and H for a round according to a SM3 hashing standard and at least one second field that identifies an input message; and an execution circuit to execute the decoded single instruction to: rotate state element C left by 9 bits to form a rotated state element C, rotate state element D left by 9 bits to form a rotated state element D, rotate state element G left by 19 bits to form a rotated state element G, rotate state element H left by 19 bits to form a rotated state element H, perform two rounds according to the SM3 hashing standard on the input message and state element A, state element B, rotated state element C, rotated state element D, state element E, state element F, rotated state element G, and rotated state element H to generate an updated state element A, an updated state element B, an updated state element E, and an updated state element F, and store the updated state element A, the updated state element B, the updated state element E, and the updated state element F into a location specified by the single instruction.
Abstract:
A processor includes an instruction set architecture that having instructions to perform data parallel multiply on a set of 52-bit integers and further instructions that additionally perform an add or subtract on intermediate products of the data parallel multiply. A 52-bit result of the operations is then zero extended to 64-bits.
Abstract:
Systems, methods, and apparatuses for accelerating streaming data-transformation operations are described. In one example, a system on a chip (SoC) includes a hardware processor core comprising a decoder circuit to decode an instruction comprising an opcode into a decoded instruction, the opcode to indicate an execution circuit is to generate a single descriptor and cause the single descriptor to be sent to an accelerator circuit coupled to the hardware processor core, and the execution circuit to execute the decoded instruction according to the opcode; and the accelerator circuit comprising a work dispatcher circuit and one or more work execution circuits to, in response to the single descriptor sent from the hardware processor core: when a field of the single descriptor is a first value, cause a single job to be sent by the work dispatcher circuit to a single work execution circuit of the one or more work execution circuits to perform an operation indicated in the single descriptor to generate an output, and when the field of the single descriptor is a second different value, cause a plurality of jobs to be sent by the work dispatcher circuit to the one or more work execution circuits to perform the operation indicated in the single descriptor to generate the output as a single stream.
Abstract:
Methods and apparatus relating to verifying a compressed stream fused with copy or transform operation(s) are described. In an embodiment, compression logic circuitry compresses input data and stores the compressed data in a temporary buffer. The compression logic circuitry determines a first checksum value corresponding to the compressed data stored in the temporary buffer. Decompression logic circuitry performs a decompress-verify operation and a copy operation. The decompress-verify operation decompresses the compressed data stored in the temporary buffer to determine a second checksum value corresponding to the decompressed data from the temporary buffer. The copy operation transfers the compressed data from the temporary buffer to a destination buffer in response to a match between the first checksum value and the second checksum value. Other embodiments are also disclosed and claimed.