Abstract:
The system described herein may track references to a shared object by concurrently executing threads using a reference tracking data structure that includes an owner field and an array of byte-addressable per-thread entries, each including a per-thread reference counter and a per-thread counter lock. Slotted threads assigned to a given array entry may increment or decrement the per-thread reference counter in that entry in response to referencing or dereferencing the shared object. Unslotted threads may increment or decrement a shared unslotted reference counter. A thread may update the data structure and/or examine it to determine whether the number of references to the shared object is zero or non-zero using a blocking-optimistic or a non-blocking mechanism. A checking thread may acquire ownership of the data structure, obtain an instantaneous snapshot of all counters, and return a value indicating whether the number of references to the shared object is zero or non-zero.
Abstract:
NUMA-aware reader-writer locks may leverage lock cohorting techniques to band together writer requests from a single NUMA node. The locks may relax the order in which the lock schedules the execution of critical sections of code by reader threads and writer threads, allowing lock ownership to remain resident on a single NUMA node for long periods, while also taking advantage of parallelism between reader threads. Threads may contend on node-level structures to get permission to acquire a globally shared reader-writer lock. Writer threads may follow a lock cohorting strategy of passing ownership of the lock in write mode from one thread to a cohort writer thread without releasing the shared lock, while reader threads from multiple NUMA nodes may simultaneously acquire the shared lock in read mode. The reader-writer lock may follow a writer-preference policy, a reader-preference policy or a hybrid policy.
Abstract:
Systems and methods for optimizing code may use transactional memory to optimize one code section by forcing another code section to execute atomically. Application source code may be analyzed to identify instructions in one code section that only need to be executed if there exists the possibility that another code section (e.g., a critical section) could be partially executed or that its results could be affected by interference. In response to identifying such instructions, alternate code may be generated that forces the critical section to be executed as an atomic transaction, e.g., using best-effort hardware transactional memory. This alternate code may replace the original code or may be included in an alternate execution path that can be conditionally selected for execution at runtime. The alternate code may elide the identified instructions (which are rendered unnecessary by the transaction) by removing them, or by including them in the alternate execution path.
Abstract:
A computer system may recognize a busy-wait loop in program instructions at compile time and/or may recognize busy-wait looping behavior during execution of program instructions. The system may recognize that an exit condition for a busy-wait loop is specified by a conditional branch type instruction in the program instructions. In response to identifying the loop and the conditional branch type instruction that specifies its exit condition, the system may influence or override a prediction made by a dynamic branch predictor, resulting in a prediction that the exit condition will be met and that the loop will be exited regardless of any observed branch behavior for the conditional branch type instruction. The looping instructions may implement waiting for an inter-thread communication event to occur or for a lock to become available. When the exit condition is met, the loop may be exited without incurring a misprediction delay.
Abstract:
The system and methods described herein may be used to implement a scalable, hierarchal, queue-based lock using flat combining. A thread executing on a processor core in a cluster of cores that share a memory may post a request to acquire a shared lock in a node of a publication list for the cluster using a non-atomic operation. A combiner thread may build an ordered (logical) local request queue that includes its own node and nodes of other threads (in the cluster) that include lock requests. The combiner thread may splice the local request queue into a (logical) global request queue for the shared lock as a sub-queue. A thread whose request has been posted in a node that has been combined into a local sub-queue and spliced into the global request queue may spin on a lock ownership indicator in its node until it is granted the shared lock.
Abstract:
Techniques for controlling a thread on a computerized system having multiple processors involve accessing state information of a blocked thread, and maintaining the state information of the blocked thread at current values when the state information indicates that less than a predetermined amount of time has elapsed since the blocked thread ran on the computerized system. Such techniques further involve setting the state information of the blocked thread to identify affinity for a particular processor of the multiple processors when the state information indicates that at least the predetermined amount of time has elapsed since the blocked thread ran on the computerized system. Such operation enables the system to place a cold blocked thread which shares data with another thread on the same processor of that other thread so that, when the blocked thread awakens and runs, that thread is closer to the shared data.
Abstract:
Transactional Lock Elision (TLE) may allow threads in a multi-threaded system to concurrently execute critical sections as speculative transactions. Such speculative transactions may abort due to contention among threads. Systems and methods for managing contention among threads may increase overall performance by considering both local and global execution data in reducing, resolving, and/or mitigating such contention. Global data may include aggregated and/or derived data representing thread-local data of remote thread(s), including transactional abort history, abort causal history, resource consumption history, performance history, synchronization history, and/or transactional delay history. Local and/or global data may be used in determining the mode by which critical sections are executed, including TLE and mutual exclusion, and/or to inform concurrency throttling mechanisms. Local and/or global data may also be used in determining concurrency throttling parameters (e.g., delay intervals) used in delaying a thread when attempting to execute a transaction and/or when retrying a previously aborted transaction.
Abstract:
Apparatus, methods, and program products are disclosed that provide a technology that implicitly isolates a portion of a transactional memory that is shared between multiple threads for exclusive use by an isolating thread without the possibility of other transactions modifying the isolated portion of the transactional memory. In some of the described embodiments read locations of a shared memory are covered by a first set of lock objects, and write locations are covered by a second set of lock objects, each lock object in each set having a reader mode and a writer mode. Some of these embodiments acquiring each lock object in the first set using the reader mode, and acquire each lock object in the second set using the writer mode. These embodiments store result data values at write locations in the shared memory subsequent to the acquiring said first and second set of lock objects.
Abstract:
Techniques for controlling a thread on a computerized system having multiple processors involve accessing state information of a blocked thread, and maintaining the state information of the blocked thread at current values when the state information indicates that less than a predetermined amount of time has elapsed since the blocked thread ran on the computerized system. Such techniques further involve setting the state information of the blocked thread to identify affinity for a particular processor of the multiple processors when the state information indicates that at least the predetermined amount of time has elapsed since the blocked thread ran on the computerized system. Such operation enables the system to place a cold blocked thread which shares data with another thread on the same processor of that other thread so that, when the blocked thread awakens and runs, that thread is closer to the shared data.
Abstract:
Transactional Lock Elision (TLE) may allow threads in a multi-threaded system to concurrently execute critical sections as speculative transactions. Such speculative transactions may abort due to contention among threads. Systems and methods for managing contention among threads may increase overall performance by considering both local and global execution data in reducing, resolving, and/or mitigating such contention. Global data may include aggregated and/or derived data representing thread-local data of remote thread(s), including transactional abort history, abort causal history, resource consumption history, performance history, synchronization history, and/or transactional delay history. Local and/or global data may be used in determining the mode by which critical sections are executed, including TLE and mutual exclusion, and/or to inform concurrency throttling mechanisms. Local and/or global data may also be used in determining concurrency throttling parameters (e.g., delay intervals) used in delaying a thread when attempting to execute a transaction and/or when retrying a previously aborted transaction.