Abstract:
A processor comprising: execution logic to execute a first thread including an accelerator invocation instruction to invoke an accelerator command; an accelerator to execute an accelerator thread in response to the accelerator command, the accelerator to store state data associated with the accelerator thread in a application memory area in memory, wherein prior to executing the accelerator thread, the accelerator is to lock entries in a translation lookaside buffer (TLB) associated with the accelerator thread to prevent an exception which might otherwise result.
Abstract:
An apparatus and method are described for retrieving elements from a linked structure. For example, one embodiment of an apparatus comprises: a decode unit to decode a first instruction, the first instruction to utilize a current address value, an end address value, and an offset; and an execution unit to execute the first instruction to cause the execution unit to compare the current address value with the end address value, the execution unit to perform no additional operation with respect to the first instruction if the current address value is equal to the end address value; and if the current address value is not equal to the end address value, then the execution unit to add the offset value to the current address value to identify a next address pointer within an element structure, the execution unit to further set the current address value equal to the next address pointer.
Abstract:
A processor having one or more processing cores is described. Each of the one or more processing cores has front end logic circuitry and a plurality of processing units. The front end logic circuitry is to fetch respective instructions of threads and decode the instructions into respective micro-code and input operand and resultant addresses of the instructions. Each of the plurality of processing units is to be assigned at least one of the threads, is coupled to said front end unit, and has a respective buffer to receive and store microcode of its assigned at least one of the threads. Each of the plurality of processing units also comprises: i) at least one set of functional units corresponding to a complete instruction set offered by the processor, the at least one set of functional units to execute its respective processing unit's received microcode; ii) registers coupled to the at least one set of functional units to store operands and resultants of the received microcode; iii) data fetch circuitry to fetch input operands for the at least one functional units' execution of the received microcode.
Abstract:
A processor includes circuitry to decode at least one instruction and an execution unit. The decoded instruction may compute a floating point result. The execution unit includes circuitry to execute the instruction to determine the floating point result, compute the amount of precision lost in a mantissa of the floating point result, compare the amount of precision lost to a numeric accumulation error precision threshold, determine whether a numeric accumulation error occurred based on the comparison, and write a value to a flag. The amount of precision lost corresponds to a plurality of bits lost in the mantissa of the floating point result. The value to be written to the flag may be based on the determination that the numeric accumulation error occurred. The flag may be for notification that the numeric accumulation error occurred.
Abstract:
A processor of an aspect includes a plurality of processor elements, and a first processor element. The first processor element may perform a user-level fork instruction of a software thread. The first processor element may include a decoder to decode the user-level fork instruction. The user-level fork instruction is to indicate at least one instruction address. The first processor element may also include a user-level thread fork module. The user-level fork module, in response to the user-level fork instruction being decoded, may configure each of the plurality of processor elements to perform instructions in parallel. Other processors, methods, systems, and instructions are disclosed.
Abstract:
An apparatus and method are described for providing low-latency invocation of accelerators. For example, a processor according to one embodiment comprises: a command register for storing command data identifying a command to be executed; a result register to store a result of the command or data indicating a reason why the commend could not be executed; execution logic to execute a plurality of instructions including an accelerator invocation instruction to invoke one or more accelerator commands; and one or more accelerators to read the command data from the command register and responsively attempt to execute the command identified by the command data.
Abstract:
Instructions and logic provide atomic range operations in a multiprocessing system. In one embodiment an atomic range modification instruction specifies an address for a set of range indices. The instruction locks access to the set of range indices and loads the range indices to check the range size. The range size is compared with a size sufficient to perform the range modification. If the range size is sufficient to perform the range modification, the range modification is performed and one or more modified range indices of the set of range indices is stored back to memory. Otherwise an error signal is set when the range size is not sufficient to perform said range modification. Access to the set of range indices is unlocked responsive to completion of the atomic range modification instruction. Embodiments may include atomic increment next instructions, add next instructions, decrement end instructions, and/or subtract end instructions.
Abstract:
A processor is described having logic circuitry of a general purpose CPU core to save multiple copies of context of a thread of the general purpose CPU core to prepare multiple micro-threads of a multi-threaded accelerator for execution to accelerate operations for the thread through parallel execution of the micro-threads.
Abstract:
A processor is described comprising: instruction failure logic to perform a plurality of operations in response to a detected instruction execution failure, the instruction failure logic to be used for instructions which have complex failure modes and which are expected to have a failure frequency above a threshold, wherein the operations include: detecting an instruction execution failure and determining a reason for the failure; storing failure data in a destination register to indicate the failure and to specify details associated with the failure; and allowing application program code to read the failure data and responsively take one or more actions responsive to the failure, wherein the instruction failure logic performs its operations without invocation of an exception handler or switching to a low level domain on a system which employs hierarchical protection domains.
Abstract:
An apparatus and method are described for executing both latency-optimized execution logic and throughput-optimized execution logic on a processing device. For example, a processor according to one embodiment comprises: latency-optimized execution logic to execute a first type of program code; throughput-optimized execution logic to execute a second type of program code, wherein the first type of program code and the second type of program code are designed for the same instruction set architecture; logic to identify the first type of program code and the second type of program code within a process and to distribute the first type of program code for execution on the latency-optimized execution logic and the second type of program code for execution on the throughput-optimized execution logic.