Abstract:
A reader of a set of data accessors that includes readers and writer detects that a particular lock of a first collection of non-global locks associated with a data object of a computing environment is held by another accessor. After checking a blocking indicator, the reader uses a second lock (which is not part of the first collection) to obtain read access to the data object and implements its reads without acquiring the particular lock. Prior to implementing a write on the data object, a writer acquires at least some locks of the first collection, and sets the blocking indicator to prevent readers from using the second lock to obtain read access to the data object.
Abstract:
In shared-memory computer systems, threads may communicate with one another using shared memory. A receiving thread may poll a message target location repeatedly to detect the delivery of a message. Such polling may cause excessive cache coherency traffic and/or congestion on various system buses and/or other interconnects. A method for inter-processor communication may reduce such bus traffic by reducing the number of reads performed and/or the number of cache coherency messages necessary to pass messages. The method may include a thread reading the value of a message target location once, and determining that this value has been modified by detecting inter-processor messages, such as cache coherence messages, indicative of such modification. In systems that support transactional memory, a thread may use transactional memory primitives to detect the cache coherence messages. This may be done by starting a transaction, reading the target memory location, and spinning until the transaction is aborted.
Abstract:
A data object has a lock and a condition indicator associated with it. Based at least partly on detecting a first setting of the condition indicator, a reader stores an indication that the reader has obtained read access to the data object in an element of a readers structure and reads the data object without acquiring the lock. A writer detects the first setting and replaces it with a second setting, indicating that the lock is to be acquired by readers before reading the data object. Prior to performing a write on the data object, the writer verifies that one or more elements of the readers structure have been cleared.
Abstract:
Concurrent threads may be synchronized at the level of the memory words they access rather than at the level of the lock that protects the execution of critical sections. Each lock may be associated with an array of flags and each flag may indicate ownership of certain memory words. A pessimistic thread may set flags corresponding to memory words it is accessing in the critical section, while an optimistic thread may read the corresponding flag before any memory access to ensure that the flag is not set and that therefore the associated memory word is not being accessed by the other thread. Thus, optimistic threads that do not have conflicts with the pessimistic thread may not have to wait for the pessimistic thread to release the lock before proceeding.
Abstract:
A concurrency-restricting lock may divide a set of threads waiting to acquire the lock into an active circulating set (ACS) that contends for the lock, and a passive set (PS) that awaits an opportunity to contend for the lock. The lock, which may include multiple constituent lock types, lists, or queues, may be unfair over the short term, but improve throughput of the underlying multithreaded application. Culling and long-term fairness policies may be applied to the lock to move excess threads from the ACS to the PS or promote threads from the PS to the ACS. These policies may constraint the size or distribution of threads in the ACS (which may be NUMA-aware). A waiting policy may avoid aggressive promotion from the PS to the ACS, and a short-term fairness policy may move a thread from the tail of a list or queue to its head.
Abstract:
When performing non-sequential accesses to large data sets, hot spots may be avoided by permuting the memory locations being accesses to more evenly spread those accesses across the memory and across multiple memory channels. A permutation step may be used when accessing data, such as to improve the distribution of those memory accesses within the system. Instead of accessing one memory address, that address may be permuted so that another memory address is accessed. Non-sequential accesses to an array may be modified such that each index to the array is permuted to another index in the array. Collisions between pre- and post-translation addresses may be prevented and one-to-one mappings may be used. Permutation mechanisms may be implemented in software, hardware, or a combination of both, with or without the knowledge of the process performing the memory accesses.
Abstract:
The systems and methods described herein may implement probabilistic counters and/or update mechanisms for those counters such that they are dependent on the value of a configurable accuracy parameter. The accuracy parameter value may be adjusted to provide fine-grained control over the tradeoff between the accuracy of the counters and the performance of applications that access them. The counters may be implemented as data structures that include a mantissa portion and an exponent portion that collectively represent an update probability value. When updating the counters, the value of the configurable accuracy parameter may affect whether, when, how often, or by what amount the mantissa portion and/or the exponent portion are updated. Updating a probabilistic counter may include multiplying its value by a constant that is dependent on the value of a configurable accuracy parameter. The counters may be accessible within transactions. The counters may have deterministic update policies.
Abstract:
Transactional Lock Elision allows hardware transactions to execute unmodified critical sections protected by the same lock concurrently, by subscribing to the lock and verifying that it is available before committing the transaction. A “lazy subscription” optimization, which delays lock subscription, can potentially cause behavior that cannot occur when the critical sections are executed under the lock. Hardware extensions may provide mechanisms to ensure that lazy subscriptions are safe (e.g., that they result in correct behavior). Prior to executing a critical section transactionally, its lock and subscription code may be identified (e.g., by writing their locations to special registers). Prior to committing the transaction, the thread executing the critical section may verify that the correct lock was correctly subscribed to. If not, or if locations identified by the special registers have been modified, the transaction may be aborted. Nested critical sections associated with different lock types may invoke different subscription code.
Abstract:
The present embodiments provide a system for supporting targeted stores in a shared-memory multiprocessor. A targeted store enables a first processor to push a cache line to be stored in a cache memory of a second processor in the shared-memory multiprocessor. This eliminates the need for multiple cache-coherence operations to transfer the cache line from the first processor to the second processor. The system includes an interface, such as an application programming interface (API), and a system call interface or an instruction-set architecture (ISA) that provides access to a number of mechanisms for supporting targeted stores. These mechanisms include a thread-location mechanism that determines a location near where a thread is executing in the shared-memory multiprocessor, and a targeted-store mechanism that targets a store to a location (e.g., cache memory) in the shared-memory multiprocessor.
Abstract:
The systems and methods described herein may be used to implement scalable statistics counters suitable for use in systems that employ a NUMA style memory architecture. The counters may be implemented as data structures that include a count value portion and a node identifier portion. The counters may be accessible within transactions. The node identifier portion may identify a node on which a thread that most recently incremented the counter was executing or one on which a thread that has requested priority to increment the shared counter was executing. Threads executing on identified nodes may have higher priority to increment the counter than other threads. Threads executing on other nodes may delay their attempts to increment the counter, thus encouraging consecutive updates from threads on a single node. Impatient threads may attempt to update the node identifier portion or may update an anti-starvation variable to indicate a request for priority.