Abstract:
Methods, and media, and computer systems are provided. The method includes, the media includes control logic for, and the computer system includes a processor with control logic for overriding an execution mask of SIMD hardware to enable at least one of a plurality of lanes of the SIMD hardware. Overriding the execution mask is responsive to a data parallel computation and a diverged control flow of a workgroup.
Abstract:
A data processing device is provided that facilitates cache coherence policies. In one embodiment, a data processing device utilizes invalidation tags in connection with a cache that is associated with a processing engine. In some embodiments, the cache is configured to store a plurality of cache entries where each cache entry includes a cache line configured to store data and a corresponding cache tag configured to store address information associated with data stored in the cache line. Such address information includes invalidation flags with respect to addresses stored in the cache tags. Each cache tag is associated with an invalidation tag configured to store information related to invalidation commands of addresses stored in the cache tag. In such embodiment, the cache is configured to set invalidation flags of cache tags based upon information stored in respective invalidation tags.
Abstract:
Methods, media, and computing systems are provided. The method includes, the media are configured for, and the computing system includes a processor with control logic for allocating memory for storing a plurality of local register states for work items to be executed in single instruction multiple data hardware and for repacking wavefronts that include work items associated with a program instruction responsive to a conditional statement. The repacking is configured to create repacked wavefronts that include at least one of a wavefront containing work items that all pass the conditional statement and a wavefront containing work items that all fail the conditional statement.
Abstract:
A technique for synchronizing workgroups is provided. Multiple workgroups execute a wait instruction that specifies a condition variable and a condition. A workgroup scheduler stops execution of a workgroup that executes a wait instruction and an advanced controller begins monitoring the condition variable. In response to the advanced controller detecting that the condition is met, the workgroup scheduler determines whether there is a high contention scenario, which occurs when the wait instruction is part of a mutual exclusion synchronization primitive and is detected by determining that there is a low number of updates to the condition variable prior to detecting that the condition has been met. In a high contention scenario, the workgroup scheduler wakes up one workgroup and schedules another workgroup to be woken up at a time in the future. In a non-contention scenario, more than one workgroup can be woken up at the same time.
Abstract:
A system, method, and computer program product are provided for a memory device system. One or more memory dies and at least one logic die are disposed in a package and communicatively coupled. The logic die comprises a processing device configurable to manage virtual memory and operate in an operating mode. The operating mode is selected from a set of operating modes comprising a slave operating mode and a host operating mode.
Abstract:
The described embodiments include a computing device with two or more types of processors and a memory that is shared between the two or more types of processors. The computing device performs operations for handling cache coherency between the two or more types of processors. During operation, the computing device sets a cache coherency indicator in metadata in a page table entry in a page table, the page table entry information about a page of data that is stored in the memory. The computing device then uses the cache coherency indicator to determine operations to be performed when accessing data in the page of data in the memory. For example, the computing device can use the coherency indicator to determine whether a coherency operation is to be performed when a processor of a given type accesses data in the page of data in the memory.
Abstract:
Described herein is an apparatus and method for remote scoped synchronization, which is a new semantic that allows a work-item to order memory accesses with a scope instance outside of its scope hierarchy. More precisely, remote synchronization expands visibility at a particular scope to all scope-instances encompassed by that scope. Remote scoped synchronization operation allows smaller scopes to be used more frequently and defers added cost to only when larger scoped synchronization is required. This enables programmers to optimize the scope that memory operations are performed at for important communication patterns like work stealing. Executing memory operations at the optimum scope reduces both execution time and energy. In particular, remote synchronization allows a work-item to communicate with a scope that it otherwise would not be able to access. Specifically, work-items can pull valid data from and push updates to scopes that do not (hierarchically) contain them.
Abstract:
The described embodiments comprise a selection mechanism that selects a resource from a set of resources in a computing device for performing an operation. In some embodiments, the selection mechanism performs a lookup in a table selected from a set of tables to identify a resource from the set of resources. When the resource is not available for performing the operation and until another resource is selected for performing the operation, the selection mechanism identifies a next resource in the table and selects the next resource for performing the operation when the next resource is available for performing the operation.
Abstract:
The described embodiments comprise a first hardware context. The first hardware context receives, from a second hardware context, an indication of a memory location and a condition to be met by the memory location. The first hardware context then sends a signal to the second hardware context when the memory location meets the condition.
Abstract:
A method, computer program product, and system is described that determines the correctness of using memory operations in a computing device with heterogeneous computer components. Embodiments include an optimizer based on the characteristics of a Sequential Consistency for Heterogeneous-Race-Free (SC for HRF) model that analyzes a program and determines the correctness of the ordering of events in the program. HRF models include combinations of the properties: scope order, scope inclusion, and scope transitivity. The optimizer can determine when a program is heterogeneous-race-free in accordance with an SC for HRF memory consistency model. For example, the optimizer can analyze a portion of program code, respect the properties of the SC for HRF model, and determine whether a value produced by a store memory event will be a candidate for a value observed by a load memory event. In addition, the optimizer can determine whether reordering of events is possible.
Abstract translation:描述了一种方法,计算机程序产品和系统,其确定在具有异构计算机组件的计算设备中使用存储器操作的正确性。 实施例包括基于用于异构无竞争(SC for HRF)的顺序一致性的特性的优化器,该模型分析程序并确定程序中的事件的顺序的正确性。 HRF模型包括属性的组合:范围顺序,范围包含和范围传递性。 优化器可以根据HR对HRF内存一致性模型的SC来确定程序何时是异构无竞争的。 例如,优化器可以分析程序代码的一部分,尊重SC的HRF模型的属性,并且确定由存储器存储器事件产生的值是否将是由加载存储器事件观察到的值的候选。 此外,优化器可以确定是否可能重新排序事件。