Abstract:
A data processing system (DPS) supports control-flow integrity (CFI). The DPS comprises a processing element with a CFI enforcement mechanism that supports one or more CFI instructions. The DPS also comprises at least one machine-accessible medium responsive to the processing element. Managed code in the machine-accessible medium is configured (a) to execute in a managed runtime environment (MRE) in the data processing system, and (b) to transfer control out from the MRE to unmanaged code, in response to a transfer control statement in the managed code. The machine-accessible medium also comprises a binary translator which, when executed, converts unmanaged code in the data processing system into hardened unmanaged code (HUC) by including CFI features in the HUC. The CFI features comprise one or more CFI instructions to utilize the CFI enforcement mechanism of the processing element for transfers of control initiated by the HUC. Other embodiments are described and claimed.
Abstract:
Technologies for optimized binary translation include a computing device that determines a cost-benefit metric associated with each translated code block of a translation cache. The cost-benefit metric is indicative of translation cost and performance benefit associated with the translated code block. The translation cost may be determined by measuring translation time of the translated code block. The cost-benefit metric may be calculated using a weighted cost-benefit function based on an expected workload of the computing device. In response to determining to free space in the translation cache, the computing device determines whether to discard each translated code block as a function of the cost-benefit metric. In response to determining to free space in the translation cache, the computing device may increment an iteration count and skip each translated code block if the iteration count modulo the corresponding cost-benefit metric is non-zero. Other embodiments are described and claimed.
Abstract:
A vector reduction instruction is executed by a processor to provide efficient reduction operations on an array of data elements. The processor includes vector registers. Each vector register is divided into a plurality of lanes, and each lane stores the same number of data elements. The processor also includes execution circuitry that receives the vector reduction instruction to reduce the array of data elements stored in a source operand into a result in a destination operand using a reduction operator. Each of the source operand and the destination operand is one of the vector registers. Responsive to the vector reduction instruction, the execution circuitry applies the reduction operator to two of the data elements in each lane, and shifts one or more remaining data elements when there is at least one of the data elements remaining in each lane.
Abstract:
Systems, apparatuses, and methods for improving TM throughput using a TM region indicator (or color) are described. Through the use of TM region indicators younger TM regions can have their instructions retired while waiting for older TM regions to commit.
Abstract:
State recovery methods and apparatus for computing platforms are disclosed. An example method includes inserting a first instruction into optimized code to cause a first portion of a register in a first state to be saved to memory before execution of a region of the optimized code; and maintaining a value indicative of a manner in which a second portion of the register in the first state is to be restored in connection with a state recovery from the optimized code.
Abstract:
A micro-architecture may provide a hardware and software co-designed dynamic binary translation. The micro-architecture may invoke a method to perform a dynamic binary translation. The method may comprise executing original software code compiled targeting a first instruction set, using processor hardware to detect a hot spot in the software code and passing control to a binary translation translator, determining a hot spot region for translation, generating the translated code using a second instruction set, placing the translated code in a translation cache, executing the translated code from the translated cache, and transitioning back to the original software code after the translated code finishes execution.
Abstract:
A hardware profiling mechanism implemented by performance monitoring hardware enables page level automatic binary translation. The hardware during runtime identifies a code page in memory containing potentially optimizable instructions. The hardware requests allocation of a new page in memory associated with the code page, where the new page contains a collection of counters and each of the counters corresponds to one of the instructions in the code page. When the hardware detects a branch instruction having a branch target within the code page, it increments one of the counters that has the same position in the new page as the branch target in the code page. The execution of the code page is repeated and the counters are incremented when branch targets fall within the code page. The hardware then provides the counter values in the new page to a binary translator for binary translation.
Abstract:
State recovery methods and apparatus for computing platforms are disclosed. An example method includes inserting, with a processor, a first instruction into optimized code to cause a first portion of a register in a first state to be saved to memory before execution of a region of the optimized code, maintaining, with the processor, a first indication of a first manner in which the first portion of the register is to be restored in connection with a state recovery after execution of the region of the optimized code, and maintaining, with the processor, a second indication of a second manner in which a second portion of the register is to be restored in connection with the state recovery after execution of the region of the optimized code.
Abstract:
Embodiments of an invention for a load instruction for code conversion are disclosed. In one embodiment, a processor includes an instruction unit and an execution unit. The instruction unit is to receive an instruction having a source operand to indicate a source location and a destination operand to indicate a destination location. The execution unit is to execute the instruction. Execution of the instruction includes checking the access permissions of the source location and loading content from the source location into the destination location if the access permissions of the source location indicate that the content is executable.
Abstract:
A processor includes a decode unit to decode a return target restrictive return from procedure (RTR return) instruction. A return target restriction unit is responsive to the RTR return instruction to determine whether to restrict an attempt by the RTR return instruction to make a control flow transfer to an instruction at a return address corresponding to the RTR return instruction. The determination is based on compatibility of a type of the instruction at the return address with the RTR return instruction and based on compatibility of first return target restrictive information (RTR information) of the RTR return instruction with second RTR information of the instruction at the return address. A control flow transfer unit is responsive to the RTR return instruction to transfer control flow to the instruction at the return address when the return target restriction unit determines not to restrict the attempt.