MESSAGE AGGREGATION, COMBINING AND COMPRESSION FOR EFFICIENT DATA COMMUNICATIONS IN GPU-BASED CLUSTERS
    21.
    发明申请
    MESSAGE AGGREGATION, COMBINING AND COMPRESSION FOR EFFICIENT DATA COMMUNICATIONS IN GPU-BASED CLUSTERS 审中-公开
    基于GPU的群集中的有效数据通信的消息聚合,组合和压缩

    公开(公告)号:US20160352598A1

    公开(公告)日:2016-12-01

    申请号:US15165953

    申请日:2016-05-26

    CPC classification number: H04L47/365

    Abstract: A system and method for efficient management of network traffic management of highly data parallel computing. A processing node includes one or more processors capable of generating network messages. A network interface is used to receive and send network messages across a network. The processing node reduces at least one of a number or a storage size of the original network messages into one or more new network messages. The new network messages are sent to the network interface to send across the network.

    Abstract translation: 一种高效数据并行计算网络流量管理高效管理的系统和方法。 处理节点包括能够生成网络消息的一个或多个处理器。 网络接口用于通过网络接收和发送网络消息。 处理节点将原始网络消息的数量或存储大小中的至少一个减少到一个或多个新的网络消息中。 新的网络消息被发送到网络接口以在网络上发送。

    TWO-PHASE HYBRID VERTEX CLASSIFICATION
    22.
    发明申请
    TWO-PHASE HYBRID VERTEX CLASSIFICATION 审中-公开
    两相混合VERTEX分类

    公开(公告)号:US20160343343A1

    公开(公告)日:2016-11-24

    申请号:US14720293

    申请日:2015-05-22

    Inventor: Shuai Che

    CPC classification number: G09G5/001 G06T1/20 G09G5/04

    Abstract: A processor performs vertex coloring for a graph based at least in part on the degree of each vertex of the graph and based at least in part with another coloring approach, such as comparison of random values assigned to the vertices. For each vertex in the graph, a processor determines whether the degree of the vertex is a local maximum; that is, whether the degree of the vertex is greater than the degree of each of its connected vertices. Each vertex having a local-maximum degree is assigned a specified or randomly selected color, and is then omitted from future iterations of the coloring process. After a stop criterion is met, the processor assigns random values to the remaining uncolored vertices and assigns colors based on comparisons of the random values.

    Abstract translation: 处理器至少部分地基于图的每个顶点的程度执行顶点着色,并且至少部分地基于另一着色方法,例如分配给顶点的随机值的比较。 对于图中的每个顶点,处理器确定顶点的度数是否是局部最大值; 也就是说,顶点的程度是否大于其每个连接顶点的程度。 具有局部最大度的每个顶点被分配指定的或随机选择的颜色,然后在着色过程的将来迭代中被省略。 在满足停止标准之后,处理器将随机值分配给剩余的未着色顶点,并根据随机值的比较分配颜色。

    PROCESSOR AND METHODS FOR REMOTE SCOPED SYNCHRONIZATION
    23.
    发明申请
    PROCESSOR AND METHODS FOR REMOTE SCOPED SYNCHRONIZATION 有权
    用于远程同步同步的处理器和方法

    公开(公告)号:US20160139624A1

    公开(公告)日:2016-05-19

    申请号:US14542042

    申请日:2014-11-14

    Abstract: Described herein is an apparatus and method for remote scoped synchronization, which is a new semantic that allows a work-item to order memory accesses with a scope instance outside of its scope hierarchy. More precisely, remote synchronization expands visibility at a particular scope to all scope-instances encompassed by that scope. Remote scoped synchronization operation allows smaller scopes to be used more frequently and defers added cost to only when larger scoped synchronization is required. This enables programmers to optimize the scope that memory operations are performed at for important communication patterns like work stealing. Executing memory operations at the optimum scope reduces both execution time and energy. In particular, remote synchronization allows a work-item to communicate with a scope that it otherwise would not be able to access. Specifically, work-items can pull valid data from and push updates to scopes that do not (hierarchically) contain them.

    Abstract translation: 这里描述的是一种用于远程作用域同步的装置和方法,它是一种新的语义,其允许工作项目使用其范围层级之外的范围实例来排序存储器访问。 更准确地说,远程同步将特定范围的可见性扩展到该范围包含的所有范围实例。 远程作用域同步操作允许更频繁地使用较小的范围,并且只有在需要较大的范围同步时才会降低增加的成本。 这使程序员可以优化执行存储器操作的范围,以便重要的通信模式,如工作窃取。 以最佳范围执行内存操作可以减少执行时间和能量。 特别地,远程同步允许工作项目与否则将无法访问的范围进行通信。 具体来说,工作项可以从不(分级)包含它们的范围提取有效的数据并将更新推送到范围。

    Mechanisms to Save User/Kernel Copy for Cross Device Communications
    24.
    发明申请
    Mechanisms to Save User/Kernel Copy for Cross Device Communications 有权
    保存用于跨设备通信的用户/内核副本的机制

    公开(公告)号:US20150261457A1

    公开(公告)日:2015-09-17

    申请号:US14213640

    申请日:2014-03-14

    Abstract: Central processing units (CPUs) in computing systems manage graphics processing units (GPUs), network processors, security co-processors, and other data heavy devices as buffered peripherals using device drivers. Unfortunately, as a result of large and latency-sensitive data transfers between CPUs and these external devices, and memory partitioned into kernel-access and user-access spaces, these schemes to manage peripherals may introduce latency and memory use inefficiencies. Proposed are schemes to reduce latency and redundant memory copies using virtual to physical page remapping while maintaining user/kernel level access abstractions.

    Abstract translation: 计算系统中的中央处理单元(CPU)使用设备驱动程序来管理图形处理单元(GPU),网络处理器,安全协处理器和其他数据重型设备作为缓冲外设。 不幸的是,由于CPU和这些外部设备之间的大型和延迟敏感的数据传输,以及分区为内核访问和用户访问空间的内存,这些管理外设的方案可能会导致延迟和内存使用效率低下。 提出的方案是在维护用户/内核级访问抽象的同时,使用虚拟到物理页面重映射来减少延迟和冗余内存副本。

    OPTIMIZING INFERENCE FOR DEEP-LEARNING NEURAL NETWORKS IN A HETEROGENEOUS SYSTEM

    公开(公告)号:US20200005135A1

    公开(公告)日:2020-01-02

    申请号:US16023638

    申请日:2018-06-29

    Inventor: Shuai Che

    Abstract: Systems, methods, and devices for deploying an artificial neural network (ANN). Candidate ANNs are generated for performing an inference task based on specifications of a target inference device. Trained ANNs are generated by training the candidate ANNs to perform the inference task on an inference device conforming to the specifications. Characteristics describing the trained ANNs performance of the inference task on a device conforming to the specifications are determined. Profiles that reflect the characteristics of each trained ANN are stored. The stored profiles are queried based on requirements of an application to select an ANN from among the trained ANNs. The selected ANN is deployed on an inference device conforming to the target inference device specifications. Input data is communicated to the deployed ANN from the application. An output is generated using the deployed ANN, and the output is communicated to the application.

    Transmission of large messages in computer systems

    公开(公告)号:US10459776B2

    公开(公告)日:2019-10-29

    申请号:US15614498

    申请日:2017-06-05

    Inventor: Shuai Che

    Abstract: Techniques for managing message transmission in a large networked computer system that includes multiple individual networked computing systems are disclosed. Message passing among the computing systems include a sending computing device transmitting a message to a receiver computing device and a receiver computing device consuming that message. A build-up of data stored in a buffer at the receiver can reduce performance. In order to reduce the potential performance degradation associated with large amounts of “waiting” data in the buffer, a sending computer system first determines whether the receiver computer system is ready to receive a message and does not transmit the message if the receiver computer system is not ready. To determine whether the receiver computer system is ready to receive a message, the receiver computer system, at the request of the sending computer system, checks a counting filter that stores indications of whether particular messages are ready.

    RECONFIGURABLE PREDICTION ENGINE FOR GENERAL PROCESSOR COUNTING

    公开(公告)号:US20190286971A1

    公开(公告)日:2019-09-19

    申请号:US15922875

    申请日:2018-03-15

    Abstract: Systems, methods, and devices for determining a derived counter value based on a hardware performance counter. Example devices include input circuitry configured to input a hardware performance counter value; counter engine circuitry configured to determine the derived counter value by applying a model to the hardware performance counter value; and output circuitry configured to communicate the derived counter value to a consumer. In some examples, the consumer includes an operating system scheduler, a memory controller, a power manager, or a data prefetcher, or a cache controller. In some examples, the processor includes circuitry configured to dynamically change the model during operation of the processor. In some examples, the model includes or is generated by an artificial neural network (ANN).

    Method and apparatus for masking and transmitting data

    公开(公告)号:US10042774B2

    公开(公告)日:2018-08-07

    申请号:US15268974

    申请日:2016-09-19

    Abstract: A method and apparatus for transmitting data includes determining whether to apply a mask to a cache line that includes a first type of data and a second type of data for transmission based upon a first criteria. The second type of data is filtered from the cache line, and the first type of data along with an identifier of the applied mask is transmitted. The first type of data and the identifier is received, and the second type of data is combined with the first type of data to recreate the cache line based upon the received identifier.

    Message handler compiling and scheduling in heterogeneous system architectures

    公开(公告)号:US10025605B2

    公开(公告)日:2018-07-17

    申请号:US15094615

    申请日:2016-04-08

    Abstract: A receiving node in a computer system that includes a plurality of types of execution units receives an active message from a sending node. The receiving node compiles an intermediate language message handler corresponding to the active message into a machine instruction set architecture (ISA) message handler and the receiver executes the ISA message handler on a selected one of the execution units. If the active message handler is not available at the receiver, the sender sends an intermediate language version of the message handler to the receiving node. The execution unit selected to execute the message handler is chosen based on a field in the active message or on runtime criteria in the receiving system.

Patent Agency Ranking