Abstract:
Various computing network messaging techniques and apparatus are disclosed. In one aspect, a method of computing is provided that includes executing a first thread and a second thread. A message is sent from the first thread to the second thread. The message includes a domain descriptor that identifies a first location of the first thread and a second location of the second thread.
Abstract:
A method of performing memory synchronization operations is provided that includes receiving, at a programmable cache controller in communication with one or more caches, an instruction in a first language to perform a memory synchronization operation of synchronizing a plurality of instruction sequences executing on a processor, mapping the received instruction in the first language to one or more selected cache operations in a second language executable by the cache controller and executing the one or more cache operations to perform the memory synchronization operation. The method further comprises receiving a second mapping that provides mapping instructions to map the received instruction to one or more other cache operations, mapping the received instruction to one or more other cache operations and executing the one or more other cache operations to perform the memory synchronization operation.
Abstract:
A processing system having a multilevel cache hierarchy employs techniques for repurposing dead cache blocks so as to use otherwise wasted space in a cache hierarchy employing a write-back scheme. For a cache line containing invalid data with a valid tag, the valid tag is maintained for cache coherence purposes or otherwise, resulting in a valid tag for a dead cache block. A cache controller repurposes the dead cache block by storing any of a variety of new data at the dead cache block, while storing the new tag in a tag entry of a dead block tag way with an identifier indicating the location of the new data.
Abstract:
A client device detects one or more servers to which an application can be offloaded. The client device receives information from the servers regarding their graphics processing unit (GPU) compute resources. The client device selects one of the servers to offload the application based on such factors as the GPU compute resources, other performance metrics, power, and bandwidth/latency/quality of the communication channel between the server and the client device. The client device sends host code and a GPU computation kernel in intermediate language format to the server. The server compiles the host code and GPU kernel code into suitable machine instruction set architecture code for execution on CPU(s) and GPU(s) of the server. Once the application execution is complete, the server returns the results of the execution to the client device.
Abstract:
A method of message-based communication is provided which includes executing, on one or more accelerated processing units, a plurality of groups of work items, receiving a first message from a first group of work items of the plurality of groups of work items executing on the one or more accelerated processing units and storing the first message at a first segment of memory allocated to a second group of work items of the plurality of groups of work items executing on the accelerated processing unit.
Abstract:
A processor employs a memory tree and a code generation and scheduling framework (CGSF) to generate instructions to access data at memory modules associated with the processor. The memory tree is a data structure having a plurality of nodes, with each node corresponding to a different memory module, memory cluster, or other portion of memory. The CGSF employs the memory tree to expose the memory hierarchy of the processor to a computer programmer. The computer programmer can employ compiler directives to identify nodes of the memory tree and to establish data ordering and manipulation formats for each node. Based on the directives and the memory tree, the CGSF generates schedules of instructions that, when executed at the processor, enforce the data ordering and manipulation formats.
Abstract:
A method and a system for block scheduling are disclosed. The method includes retrieving an original block ID, determining a corresponding new block ID from a mapping, executing a new block corresponding to the new block ID, and repeating the retrieving, determining, and executing for each original block ID. The system includes a program memory configured to store multi-block computer programs, an identifier memory configured to store block identifiers (ID's), management hardware configured to retrieve an original block ID from the program memory, scheduling hardware configured to receive the original block ID from the management hardware and determine a new block ID corresponding to the original block ID using a stored mapping, and processing hardware configured to receive the new block ID from the scheduling hardware and execute a new block corresponding to the new block ID.
Abstract:
A method of message-based communication is provided which includes executing, on one or more accelerated processing units, a plurality of groups of work items, receiving a first message from a first group of work items of the plurality of groups of work items executing on the one or more accelerated processing units and storing the first message at a first segment of memory allocated to a second group of work items of the plurality of groups of work items executing on the accelerated processing unit.
Abstract:
A processor performs vertex coloring for a graph based at least in part on the degree of each vertex of the graph and based at least in part with another coloring approach, such as comparison of random values assigned to the vertices. For each vertex in the graph, a processor determines whether the degree of the vertex is a local maximum; that is, whether the degree of the vertex is greater than the degree of each of its connected vertices. Each vertex having a local-maximum degree is assigned a specified or randomly selected color, and is then omitted from future iterations of the coloring process. After a stop criterion is met, the processor assigns random values to the remaining uncolored vertices and assigns colors based on comparisons of the random values.
Abstract:
The described embodiments include a computing device with two or more types of processors and a memory that is shared between the two or more types of processors. The computing device performs operations for handling cache coherency between the two or more types of processors. During operation, the computing device sets a cache coherency indicator in metadata in a page table entry in a page table, the page table entry information about a page of data that is stored in the memory. The computing device then uses the cache coherency indicator to determine operations to be performed when accessing data in the page of data in the memory. For example, the computing device can use the coherency indicator to determine whether a coherency operation is to be performed when a processor of a given type accesses data in the page of data in the memory.