摘要:
A memory race recorder (MRR) is provided. The MRR includes a multi-core processor having a relaxed memory consistency model, an extension to the multi-core processor, the extension to store chunks, the chunk having a chunk size (CS) and an instruction count (IC), and a plurality of cores to execute instructions. The plurality of cores executes load/store instructions to/from a store buffer (STB) and a simulated memory to store the value when the value is not in the STB. The oldest value in the STB is transferred to the simulated memory when the IC is equal to zero and the CS is greater than zero. The MRR logs a trace entry comprising the CS, the IC, and a global timestamp, the global timestamp proving a total order across all logged chunks.
摘要:
An apparatus and method are described for a hardware transactional memory (HTM) profiler. For example, one embodiment of an apparatus comprises a transactional debugger (TDB) recording module to record data related to the execution of transactional memory program code, including data related to the execution of branches and transactional events in the transactional memory program code; and a profiler to analyze portions of the recorded data using trace-based replay techniques to responsively generate profile data comprising transaction-level events and function-level conflict data usable to optimize the transactional memory program code.
摘要:
A system is disclosed that includes a processor and a dynamic random access memory (DRAM). The processor includes a hybrid transactional memory (HyTM) that includes hardware transactional memory (HTM), and a program debugger to replay a program that includes an HTM instruction and that has been executed has been executed using the HyTM. The program debugger includes a software emulator that is to replay the HTM instruction by emulation of the HTM. Other embodiments are disclosed and claimed.
摘要:
A processing device implementing unbounded transactional memory with forward progress guarantees using a hardware global lock is disclosed. A processing device of the disclosure includes a hardware transactional memory (HTM) hardware contention manager to cause a bounded transaction to be translated to an unbounded transaction, the unbounded transaction to acquire a global hardware lock for the unbounded transaction, the global hardware lock read by bounded transactions that abort when the global hardware lock is taken. The processing device further includes an execution unit communicably coupled to the HTM hardware contention manager to execute instructions of the unbounded transaction without speculation, the unbounded transaction to release the global hardware lock upon completion of execution of the instructions.
摘要:
A processing device implementing unbounded transactional memory with forward progress guarantees using a hardware global lock is disclosed. A processing device of the disclosure includes a hardware transactional memory (HTM) hardware contention manager to cause a bounded transaction to be translated to an unbounded transaction, the unbounded transaction to acquire a global hardware lock for the unbounded transaction, the global hardware lock read by bounded transactions that abort when the global hardware lock is taken. The processing device further includes an execution unit communicably coupled to the HTM hardware contention manager to execute instructions of the unbounded transaction without speculation, the unbounded transaction to release the global hardware lock upon completion of execution of the instructions.
摘要:
A processor includes a front end to decode an instruction and pass the instruction to execution units with branch suffix information. The processor further includes execution units to execute the instruction and a retirement unit to retire the instruction. The instruction is to specify an operation to be conditionally executed based upon a branch suffix to identify previous execution. The processor further includes logic to, upon retirement of the instruction, determine the result of a series of branch operations preceding execution of the instruction, compare the result to the branch suffix information, allow execution and retirement of the instruction based on a determination that the result matches the branch suffix information, and generate a fault based on a determination that the result does not match the branch suffix information.
摘要:
Technologies for optimizing sparse matrix code include a target computing device having a processor and a field-programmable gate array (FPGA). A compiler identifies a performance-critical loop in a sparse matrix source code and generates optimized executable code, including processor code and FPGA code. The target computing device executes the optimized executable code, using the processor for the processor code and the FPGA for the FPGA code. The processor executes a first iteration of the loop, generates reusable optimization data in response to executing the first iteration, and stores the reusable optimization data in a shared memory. The FPGA accesses the optimization data in the shared memory, executes additional iterations of the loop, and optimizes the additional iterations of the loop based on the optimization data. The optimization data may include, for example, loop-invariant data, reordered data, or alternate data storage representations. Other embodiments are described and claimed.