Abstract:
Systems, methods, and apparatuses relating to swizzle operations and disable operations in a configurable spatial accelerator (CSA) are described. Certain embodiments herein provide for an encoding system for a specific set of swizzle primitives across a plurality of packed data elements in a CSA. In one embodiment, a CSA includes a plurality of processing elements, a circuit switched interconnect network between the plurality of processing elements, and a configuration register within each processing element to store a configuration value having a first portion that, when set to a first value that indicates a first mode, causes the processing element to pass an input value to operation circuitry of the processing element without modifying the input value, and, when set to a second value that indicates a second mode, causes the processing element to perform a swizzle operation on the input value to form a swizzled input value before sending the swizzled input value to the operation circuitry of the processing element, and a second portion that causes the processing element to perform an operation indicated by the second portion the configuration value on the input value in the first mode and the swizzled input value in the second mode with the operation circuitry.
Abstract:
Methods and apparatuses relate to emulating architectural performance monitoring in a binary translation system. In one embodiment, a processor includes an architectural performance counter to maintain an architectural value associated with instruction execution, a register to store the architectural value of the architectural performance counter, binary translation logic to embed an architectural value from the architectural performance counter into a stream of translated instructions having a transactional code region and to store the architectural value into the register, and an execution unit to execute the transactional code region of the stream of translated instructions. The binary translation logic is configured to add the architectural value from the register to the architectural performance counter upon completion of the transactional code region of the stream of translated instructions. In one embodiment, a binary translation system overcomes software incompatibilities by using microarchitectural support to transparently and accurately emulate architectural performance counter behavior.
Abstract:
Technologies for optimized binary translation include a computing device that determines a cost-benefit metric associated with each translated code block of a translation cache. The cost-benefit metric is indicative of translation cost and performance benefit associated with the translated code block. The translation cost may be determined by measuring translation time of the translated code block. The cost-benefit metric may be calculated using a weighted cost-benefit function based on an expected workload of the computing device. In response to determining to free space in the translation cache, the computing device determines whether to discard each translated code block as a function of the cost-benefit metric. In response to determining to free space in the translation cache, the computing device may increment an iteration count and skip each translated code block if the iteration count modulo the corresponding cost-benefit metric is non-zero. Other embodiments are described and claimed.
Abstract:
State recovery methods and apparatus for computing platforms are disclosed. An example method includes inserting a first instruction into optimized code to cause a first portion of a register in a first state to be saved to memory before execution of a region of the optimized code; and maintaining a value indicative of a manner in which a second portion of the register in the first state is to be restored in connection with a state recovery from the optimized code.
Abstract:
Technologies for shadow stack management include a computing device that, when executing a translated call routine in a translated binary, pushes a native return address on to a native stack of the computing device, adds a constant offset to a stack pointer of the computing device, executes a native call instruction to a translated call target, and, after executing the native call instruction, subtracts the constant offset from the stack pointer. Executing the native call instruction pushes a translated return address onto a shadow stack of the computing device. The computing device may map two or more virtual memory pages of the shadow stack onto a single physical memory page. The computing device may execute a translated return routine that pops the native return address from the native stack, adds the constant offset to the stack pointer, and executes a native return instruction. Other embodiments are described and claimed.
Abstract:
Systems, apparatuses, and methods for improving transactional memory (TM) throughput using a TM region indicator (or color) are described. Through the use of TM region indicators younger TM regions can have their instructions retired while waiting for older TM regions to commit. A copy-on-write (COW) buffer may be used to maintain a mapping from checkpointed architectural registers to physical registers, wherein the COW buffer maintains a plurality of register checkpoints for a plurality of TM regions by marking separations between TM regions using pointers, a first pointer to identify a position in the COW buffer of the last committed instruction, a retirement pointer to identify a boundary between a youngest TM region and a currently retiring position.
Abstract:
Embodiments of systems, methods, and apparatuses for heterogeneous computing are described. In some embodiments, a hardware heterogeneous scheduler dispatches instructions for execution on one or more plurality of heterogeneous processing elements, the instructions corresponding to a code fragment to be processed by the one or more of the plurality of heterogeneous processing elements, wherein the instructions are native instructions to at least one of the one or more of the plurality of heterogeneous processing elements.
Abstract:
Embodiments of systems, methods, and apparatuses for heterogeneous computing are described. In some embodiments, a hardware heterogeneous scheduler dispatches instructions for execution on one or more plurality of heterogeneous processing elements, the instructions corresponding to a code fragment to be processed by the one or more of the plurality of heterogeneous processing elements, wherein the instructions are native instructions to at least one of the one or more of the plurality of heterogeneous processing elements.
Abstract:
Technologies for optimized binary translation include a computing device that determines a cost-benefit metric associated with each translated code block of a translation cache. The cost-benefit metric is indicative of translation cost and performance benefit associated with the translated code block. The translation cost may be determined by measuring translation time of the translated code block. The cost-benefit metric may be calculated using a weighted cost-benefit function based on an expected workload of the computing device. In response to determining to free space in the translation cache, the computing device determines whether to discard each translated code block as a function of the cost-benefit metric. In response to determining to free space in the translation cache, the computing device may increment an iteration count and skip each translated code block if the iteration count modulo the corresponding cost-benefit metric is non-zero. Other embodiments are described and claimed.
Abstract:
A processor includes a memory to store original code and a fingerprint data structure, which stores, in a way thereof, an entry including a physical address for a page and a stored fingerprint generated from the page of the original code. A core includes a translation protection data structure (TPDS) to detect modification to the page, wherein the core is to, upon execution of a translation check instruction included within a translated page code corresponding to the page, transmit, to the TPDS, a modification check request having the physical address of the page in the memory and the way of the fingerprint data structure. A hardware TPDS miss handler is coupled to the core and is to process a miss request received from the TPDS responsive to the physical address not being present in the TPDS.