GRAPH MATCHING FOR OPTIMIZED DEEP NETWORK PROCESSING

    公开(公告)号:US20180314945A1

    公开(公告)日:2018-11-01

    申请号:US15498943

    申请日:2017-04-27

    Abstract: Systems, apparatuses, and methods for enhanced resolution video and security via machine learning are disclosed. A system is configured to receive a source code representation of a neural network. In one embodiment, the source code representation is a directed acyclic graph (DAG). The system determines if the source code representation includes any of one or more patterns, with each pattern including two or more adjacent layers. The system also identifies, for each pattern, a combined layer with which to replace the detected pattern. If any occurrences of the one or more patterns are detected in the source code representation, the system replaces each pattern with a corresponding combined layer. Additionally, the system generates an optimized representation of the neural network, wherein the optimized representation includes replacements for any detected patterns. The optimized representation can be utilized to generate an executable version of the neural network.

    WORKLOAD PARTITIONING AMONG HETEROGENEOUS PROCESSING NODES
    2.
    发明申请
    WORKLOAD PARTITIONING AMONG HETEROGENEOUS PROCESSING NODES 有权
    在异质加工过程中进行工作分配

    公开(公告)号:US20140359126A1

    公开(公告)日:2014-12-04

    申请号:US13908887

    申请日:2013-06-03

    CPC classification number: H04L47/70 G06F9/5044 Y02D10/22

    Abstract: A method of computing is performed in a first processing node of a plurality of processing nodes of multiple types with distinct processing capabilities. The method includes, in response to a command, partitioning data associated with the command among the plurality of processing nodes. The data is partitioned based at least in part on the distinct processing capabilities of the multiple types of processing nodes.

    Abstract translation: 在具有不同处理能力的多种类型的多个处理节点的第一处理节点中执行计算方法。 该方法响应于命令,在多个处理节点之间分配与该命令相关联的数据。 至少部分地基于多种类型的处理节点的不同处理能力对数据进行分区。

    BENCHMARK GENERATION USING INSTRUCTION EXECUTION INFORMATION
    3.
    发明申请
    BENCHMARK GENERATION USING INSTRUCTION EXECUTION INFORMATION 审中-公开
    使用指令执行信息的基准生成

    公开(公告)号:US20140258688A1

    公开(公告)日:2014-09-11

    申请号:US13789233

    申请日:2013-03-07

    CPC classification number: G06F11/3428 G06F11/3466 G06F11/348

    Abstract: Methods and systems are provided for generating a benchmark representative of a reference process. One method involves obtaining execution information for a subset of the plurality of instructions of the reference process from a pipeline of a processing module during execution of those instructions by the processing module, determining performance characteristics quantifying the execution behavior of the reference process based on the execution information, and generating the benchmark process that mimics the quantified execution behavior of the reference process based on the performance characteristics.

    Abstract translation: 提供了用于生成参考过程的基准代表的方法和系统。 一种方法包括在处理模块执行这些指令期间从处理模块的流水线获取参考过程的多个指令的子集的执行信息,基于执行来确定量化参考进程的执行行为的性能特征 信息,以及基于性能特征生成模拟参考过程的量化执行行为的基准过程。

    Programming in-memory accelerators to improve the efficiency of datacenter operations

    公开(公告)号:US10198349B2

    公开(公告)日:2019-02-05

    申请号:US15269495

    申请日:2016-09-19

    Abstract: Systems, apparatuses, and methods for utilizing in-memory accelerators to perform data conversion operations are disclosed. A system includes one or more main processors coupled to one or more memory modules. Each memory module includes one or more memory devices coupled to a processing in memory (PIM) device. The main processors are configured to generate an executable for a PIM device to accelerate data conversion tasks of data stored in the local memory devices. In one embodiment, the system detects a read request for data stored in a given memory module. In order to process the read request, the system determines that a conversion from a first format to a second format is required. In response to detecting the read request, the given memory module's PIM device performs the conversion of the data from the first format to the second format and then provides the data to a consumer application.

    Power aware work stealing
    5.
    发明授权

    公开(公告)号:US10089155B2

    公开(公告)日:2018-10-02

    申请号:US14862038

    申请日:2015-09-22

    Abstract: First and second processor cores are configured to concurrently execute tasks. A scheduler is configured to schedule tasks for execution by the first and second processor cores. The first processor core is configured to selectively steal a task that was previously scheduled for execution by the second processor core based on additional power consumption incurred by migrating the task from the second processor core to the first processor core.

    Page migration acceleration using a two-level bloom filter on high bandwidth memory systems

    公开(公告)号:US10067709B2

    公开(公告)日:2018-09-04

    申请号:US15269289

    申请日:2016-09-19

    Abstract: Systems, apparatuses, and methods for accelerating page migration using a two-level bloom filter are disclosed. In one embodiment, a system includes a GPU and a CPU and a multi-level memory hierarchy. When a memory request misses in a first memory, the GPU is configured to check a first level of a two-level bloom filter to determine if a page targeted by the memory request is located in a second memory. If the first level of the two-level bloom filter indicates that the page is not in the second memory, then the GPU generates a page fault and sends the memory request to a third memory. If the first level of the two-level bloom filter indicates that the page is in the second memory, then the GPU sends the memory request to the CPU.

    Power-efficient nested map-reduce execution on a cloud of heterogeneous accelerated processing units
    7.
    发明授权
    Power-efficient nested map-reduce execution on a cloud of heterogeneous accelerated processing units 有权
    在异构加速处理单元云上的高效嵌套映射减少执行

    公开(公告)号:US09152601B2

    公开(公告)日:2015-10-06

    申请号:US13890828

    申请日:2013-05-09

    Abstract: An approach and a method for efficient execution of nested map-reduce framework workloads to take advantage of the combined execution of central processing units (CPUs) and graphics processing units (GPUs) and lower latency of data access in accelerated processing units (APUs) is described. In embodiments, metrics are generated to determine whether a map or reduce function is more efficiently processed on a CPU or a GPU. A first metric is based on ratio of a number of branch instructions to a number of non-branch instructions, and a second metric is based on the comparison of execution times on each of the CPU and the GPU. Selecting execution of map and reduce functions based on the first and second metrics result in accelerated computations. Some embodiments include scheduling pipelined executions of functions on the CPU and functions on the GPU concurrently to achieve power-efficient nested map reduce framework execution.

    Abstract translation: 嵌入式地图缩减框架工作负载以利用中央处理单元(CPU)和图形处理单元(GPU)的组合执行以及加速处理单元(APU)中数据访问的较低延迟的方法和方法是 描述。 在实施例中,生成度量以确定在CPU或GPU上是否更有效地处理地图或缩小功能。 第一度量是基于分支指令的数目与多个非分支指令的比率,第二度量是基于CPU和GPU中的每一个的执行时间的比较。 基于第一和第二指标选择地图的执行和减少功能导致加速计算。 一些实施例包括调度CPU上的功能的流水线执行和GPU上的功能,以实现功率有效的嵌套映射减少框架执行。

    Processing device with independently activatable working memory bank and methods
    8.
    发明授权
    Processing device with independently activatable working memory bank and methods 有权
    具有独立可激活工作记忆库和方法的处理设备

    公开(公告)号:US08935472B2

    公开(公告)日:2015-01-13

    申请号:US13723294

    申请日:2012-12-21

    CPC classification number: G06F12/0891 G06F12/0804 G06F2212/601 Y02D10/13

    Abstract: A data processing device is provided that includes an array of working memory banks and an associated processing engine. The working memory bank array is configured with at least one independently activatable memory bank. A dirty data counter (DDC) is associated with the independently activatable memory bank and is configured to reflect a count of dirty data migrated from the independently activatable memory bank upon selective deactivation of the independently activatable memory bank. The DDC is configured to selectively decrement the count of dirty data upon the reactivation of the independently activatable memory bank in connection with a transient state. In the transient state, each dirty data access by the processing engine to the reactivated memory bank is also conducted with respect to another memory bank of the array. Upon a condition that dirty data is found in the other memory bank, the count of dirty data is decremented.

    Abstract translation: 提供了一种数据处理装置,其包括工作存储器组和相关处理引擎的阵列。 工作存储器阵列配置有至少一个可独立激活的存储体。 脏数据计数器(DDC)与可独立激活的存储体相关联,并且被配置为反映从可独立激活的存储体选择性地去激活时从可独立激活的存储体组迁移的脏数据的计数。 DDC被配置为在与暂时状态相关联的可独立激活的存储体的重新激活时选择性地减少脏数据的计数。 在过渡状态下,处理引擎对重新激活的存储体的每个脏数据访问也相对于阵列的另一存储体进行。 在另一个存储体中发现脏数据的情况下,脏数据的计数减少。

    POWER-EFFICIENT NESTED MAP-REDUCE EXECUTION ON A CLOUD OF HETEROGENEOUS ACCELERATED PROCESSING UNITS
    9.
    发明申请
    POWER-EFFICIENT NESTED MAP-REDUCE EXECUTION ON A CLOUD OF HETEROGENEOUS ACCELERATED PROCESSING UNITS 有权
    在异质加速加工单元的云上实现功率有效的降低成本

    公开(公告)号:US20140333638A1

    公开(公告)日:2014-11-13

    申请号:US13890828

    申请日:2013-05-09

    Abstract: An approach and a method for efficient execution of nested map-reduce framework workloads to take advantage of the combined execution of central processing units (CPUs) and graphics processing units (GPUs) and lower latency of data access in accelerated processing units (APUs) is described. In embodiments, metrics are generated to determine whether a map or reduce function is more efficiently processed on a CPU or a GPU. A first metric is based on ratio of a number of branch instructions to a number of non-branch instructions, and a second metric is based on the comparison of execution times on each of the CPU and the GPU. Selecting execution of map and reduce functions based on the first and second metrics result in accelerated computations. Some embodiments include scheduling pipelined executions of functions on the CPU and functions on the GPU concurrently to achieve power-efficient nested map reduce framework execution.

    Abstract translation: 嵌入式地图缩减框架工作负载以利用中央处理单元(CPU)和图形处理单元(GPU)的组合执行以及加速处理单元(APU)中数据访问的较低延迟的方法和方法是 描述。 在实施例中,生成度量以确定在CPU或GPU上是否更有效地处理地图或缩小功能。 第一度量是基于分支指令的数目与多个非分支指令的比率,第二度量是基于CPU和GPU中的每一个的执行时间的比较。 基于第一和第二指标选择地图的执行和减少功能导致加速计算。 一些实施例包括调度CPU上的功能的流水线执行和GPU上的功能,以实现功率有效的嵌套映射减少框架执行。

    SYSTEM AND METHOD FOR PROCESSING DATA IN A COMPUTING SYSTEM

    公开(公告)号:US20170371665A1

    公开(公告)日:2017-12-28

    申请号:US15191257

    申请日:2016-06-23

    Abstract: Systems, apparatuses, and methods for adjusting group sizes to match a processor lane width are described. In early iterations of an algorithm, a processor partitions a dataset into groups of data points which are integer multiples of the processing lane width of the processor. For example, when performing a K-means clustering algorithm, the processor determines that a first plurality of data points belong to a first group during a given iteration. If the first plurality of data points is not an integer multiple of the number of processing lanes, then the processor reassigns a first number of data points from the first plurality of data points to one or more other groups. The processor then performs the next iteration with these first number of data points assigned to other groups even though the first number of data points actually meets the algorithmic criteria for belonging to the first group.

Patent Agency Ranking