Abstract:
A system and method for efficient management of network traffic management of highly data parallel computing. A processing node includes one or more processors capable of generating network messages. A network interface is used to receive and send network messages across a network. The processing node reduces at least one of a number or a storage size of the original network messages into one or more new network messages. The new network messages are sent to the network interface to send across the network.
Abstract:
A processor performs vertex coloring for a graph based at least in part on the degree of each vertex of the graph and based at least in part with another coloring approach, such as comparison of random values assigned to the vertices. For each vertex in the graph, a processor determines whether the degree of the vertex is a local maximum; that is, whether the degree of the vertex is greater than the degree of each of its connected vertices. Each vertex having a local-maximum degree is assigned a specified or randomly selected color, and is then omitted from future iterations of the coloring process. After a stop criterion is met, the processor assigns random values to the remaining uncolored vertices and assigns colors based on comparisons of the random values.
Abstract:
Described herein is an apparatus and method for remote scoped synchronization, which is a new semantic that allows a work-item to order memory accesses with a scope instance outside of its scope hierarchy. More precisely, remote synchronization expands visibility at a particular scope to all scope-instances encompassed by that scope. Remote scoped synchronization operation allows smaller scopes to be used more frequently and defers added cost to only when larger scoped synchronization is required. This enables programmers to optimize the scope that memory operations are performed at for important communication patterns like work stealing. Executing memory operations at the optimum scope reduces both execution time and energy. In particular, remote synchronization allows a work-item to communicate with a scope that it otherwise would not be able to access. Specifically, work-items can pull valid data from and push updates to scopes that do not (hierarchically) contain them.
Abstract:
Central processing units (CPUs) in computing systems manage graphics processing units (GPUs), network processors, security co-processors, and other data heavy devices as buffered peripherals using device drivers. Unfortunately, as a result of large and latency-sensitive data transfers between CPUs and these external devices, and memory partitioned into kernel-access and user-access spaces, these schemes to manage peripherals may introduce latency and memory use inefficiencies. Proposed are schemes to reduce latency and redundant memory copies using virtual to physical page remapping while maintaining user/kernel level access abstractions.
Abstract:
Systems, methods, and devices for deploying an artificial neural network (ANN). Candidate ANNs are generated for performing an inference task based on specifications of a target inference device. Trained ANNs are generated by training the candidate ANNs to perform the inference task on an inference device conforming to the specifications. Characteristics describing the trained ANNs performance of the inference task on a device conforming to the specifications are determined. Profiles that reflect the characteristics of each trained ANN are stored. The stored profiles are queried based on requirements of an application to select an ANN from among the trained ANNs. The selected ANN is deployed on an inference device conforming to the target inference device specifications. Input data is communicated to the deployed ANN from the application. An output is generated using the deployed ANN, and the output is communicated to the application.
Abstract:
Techniques for managing message transmission in a large networked computer system that includes multiple individual networked computing systems are disclosed. Message passing among the computing systems include a sending computing device transmitting a message to a receiver computing device and a receiver computing device consuming that message. A build-up of data stored in a buffer at the receiver can reduce performance. In order to reduce the potential performance degradation associated with large amounts of “waiting” data in the buffer, a sending computer system first determines whether the receiver computer system is ready to receive a message and does not transmit the message if the receiver computer system is not ready. To determine whether the receiver computer system is ready to receive a message, the receiver computer system, at the request of the sending computer system, checks a counting filter that stores indications of whether particular messages are ready.
Abstract:
Systems, methods, and devices for determining a derived counter value based on a hardware performance counter. Example devices include input circuitry configured to input a hardware performance counter value; counter engine circuitry configured to determine the derived counter value by applying a model to the hardware performance counter value; and output circuitry configured to communicate the derived counter value to a consumer. In some examples, the consumer includes an operating system scheduler, a memory controller, a power manager, or a data prefetcher, or a cache controller. In some examples, the processor includes circuitry configured to dynamically change the model during operation of the processor. In some examples, the model includes or is generated by an artificial neural network (ANN).
Abstract:
A method and apparatus for transmitting data includes determining whether to apply a mask to a cache line that includes a first type of data and a second type of data for transmission based upon a first criteria. The second type of data is filtered from the cache line, and the first type of data along with an identifier of the applied mask is transmitted. The first type of data and the identifier is received, and the second type of data is combined with the first type of data to recreate the cache line based upon the received identifier.
Abstract:
A receiving node in a computer system that includes a plurality of types of execution units receives an active message from a sending node. The receiving node compiles an intermediate language message handler corresponding to the active message into a machine instruction set architecture (ISA) message handler and the receiver executes the ISA message handler on a selected one of the execution units. If the active message handler is not available at the receiver, the sender sends an intermediate language version of the message handler to the receiving node. The execution unit selected to execute the message handler is chosen based on a field in the active message or on runtime criteria in the receiving system.
Abstract:
The described embodiments include a computing device with two or more types of processors and a memory that is shared between the two or more types of processors. The computing device performs operations for handling cache coherency between the two or more types of processors. During operation, the computing device sets a cache coherency indicator in metadata in a page table entry in a page table, the page table entry information about a page of data that is stored in the memory. The computing device then uses the cache coherency indicator to determine operations to be performed when accessing data in the page of data in the memory. For example, the computing device can use the coherency indicator to determine whether a coherency operation is to be performed when a processor of a given type accesses data in the page of data in the memory.