METHOD FOR MATRIX DATA BROADCAST IN PARALLEL PROCESSING

    公开(公告)号:US20210191758A1

    公开(公告)日:2021-06-24

    申请号:US16723016

    申请日:2019-12-20

    Abstract: Systems, apparatuses, and methods for efficient parallel execution of multiple work units in a processor by reducing a number of memory accesses are disclosed. A computing system includes a processor core with a parallel data architecture. One or more of a software application and firmware implement matrix operations and support the broadcast of shared data to multiple compute units of the processor core. The application creates thread groups by matching compute kernels of the application with data items, and grouping the resulting work units into thread groups. The application assigns the thread groups to compute units based on detecting shared data among the compute units. Rather than send multiple read access to a memory subsystem for the shared data, a single access request is generated. The single access request includes information to identify the multiple compute units for receiving the shared data when broadcasted.

    AUTOMATED PERFORMANCE VERIFICATION FOR INTEGRATED CIRCUIT DESIGN
    2.
    发明申请
    AUTOMATED PERFORMANCE VERIFICATION FOR INTEGRATED CIRCUIT DESIGN 审中-公开
    自动化性能验证集成电路设计

    公开(公告)号:US20140181768A1

    公开(公告)日:2014-06-26

    申请号:US13723279

    申请日:2012-12-21

    CPC classification number: G06F17/5081

    Abstract: A method and apparatus for automated performance verification for integrated circuit design is described herein. The method includes test preparation and automated verification stages. The test preparation stage generates design feature-specific performance tests to meet expected performance goals under certain workloads using optimization approaches and for different design configurations. The automated verification stage is implemented by integrating functional, automated modules into a verification infrastructure. These modules include register transfer level (RTL) simulation, performance evaluation and performance publish modules. The RTL simulation module schedules performance testing jobs, runs a series of performance tests on simulation logic simultaneously and generates performance counters for each functional unit. The performance evaluation module consists of three sub-functions including a functional comparison between actual results and a reference file containing the expected results, performance measurements for throughput, execution time, and latency values, and performance analysis. The performance publish module publishes performance results and analysis reports.

    Abstract translation: 本文描述了用于集成电路设计的自动化性能验证的方法和装置。 该方法包括测试准备和自动验证阶段。 测试准备阶段通过使用优化方法和不同的设计配置,在某些工作负载下生成设计特征性能测试,以满足预期的性能目标。 通过将功能自动化模块集成到验证基础设施中来实现自动化验证阶段。 这些模块包括注册传输级别(RTL)模拟,性能评估和性能发布模块。 RTL仿真模块调度性能测试作业,同时对仿真逻辑进行一系列性能测试,并为每个功能单元生成性能计数器。 性能评估模块包括三个子功能,包括实际结果与包含预期结果的参考文件,吞吐量性能测量,执行时间和延迟值以及性能分析之间的功能比较。 性能发布模块发布性能结果和分析报告。

    MATRIX DATA BROADCAST ARCHITECTURE

    公开(公告)号:US20210191761A1

    公开(公告)日:2021-06-24

    申请号:US16729811

    申请日:2019-12-30

    Abstract: Systems, apparatuses, and methods for efficient parallel execution of multiple work units in a processor by reducing a number of memory accesses are disclosed. A computing system includes a processor core with a parallel data architecture. The processor core executes a software application with matrix operations. The processor core supports the broadcast of shared data to multiple compute units of the processor core. A compiler or other code assigns thread groups to compute units based on detecting shared data among the compute units. Rather than send multiple read accesses to a memory subsystem for the shared data, the processor core generates a single access request. The single access request includes information to identify the multiple compute units for receiving the shared data when broadcasted by the processor core.

    Method for matrix data broadcast in parallel processing

    公开(公告)号:US11983560B2

    公开(公告)日:2024-05-14

    申请号:US17571374

    申请日:2022-01-07

    Abstract: Systems, apparatuses, and methods for efficient parallel execution of multiple work units in a processor by reducing a number of memory accesses are disclosed. A computing system includes a processor core with a parallel data architecture. One or more of a software application and firmware implement matrix operations and support the broadcast of shared data to multiple compute units of the processor core. The application creates thread groups by matching compute kernels of the application with data items, and grouping the resulting work units into thread groups. The application assigns the thread groups to compute units based on detecting shared data among the compute units. Rather than send multiple read access to a memory subsystem for the shared data, a single access request is generated. The single access request includes information to identify the multiple compute units for receiving the shared data when broadcasted.

    Matrix data broadcast architecture

    公开(公告)号:US11609785B2

    公开(公告)日:2023-03-21

    申请号:US16729811

    申请日:2019-12-30

    Abstract: Systems, apparatuses, and methods for efficient parallel execution of multiple work units in a processor by reducing a number of memory accesses are disclosed. A computing system includes a processor core with a parallel data architecture. The processor core executes a software application with matrix operations. The processor core supports the broadcast of shared data to multiple compute units of the processor core. A compiler or other code assigns thread groups to compute units based on detecting shared data among the compute units. Rather than send multiple read accesses to a memory subsystem for the shared data, the processor core generates a single access request. The single access request includes information to identify the multiple compute units for receiving the shared data when broadcasted by the processor core.

    Method for matrix data broadcast in parallel processing

    公开(公告)号:US11275612B2

    公开(公告)日:2022-03-15

    申请号:US16723016

    申请日:2019-12-20

    Abstract: Systems, apparatuses, and methods for efficient parallel execution of multiple work units in a processor by reducing a number of memory accesses are disclosed. A computing system includes a processor core with a parallel data architecture. One or more of a software application and firmware implement matrix operations and support the broadcast of shared data to multiple compute units of the processor core. The application creates thread groups by matching compute kernels of the application with data items, and grouping the resulting work units into thread groups. The application assigns the thread groups to compute units based on detecting shared data among the compute units. Rather than send multiple read access to a memory subsystem for the shared data, a single access request is generated. The single access request includes information to identify the multiple compute units for receiving the shared data when broadcasted.

    METHOD FOR MATRIX DATA BROADCAST IN PARALLEL PROCESSING

    公开(公告)号:US20220129312A1

    公开(公告)日:2022-04-28

    申请号:US17571374

    申请日:2022-01-07

    Abstract: Systems, apparatuses, and methods for efficient parallel execution of multiple work units in a processor by reducing a number of memory accesses are disclosed. A computing system includes a processor core with a parallel data architecture. One or more of a software application and firmware implement matrix operations and support the broadcast of shared data to multiple compute units of the processor core. The application creates thread groups by matching compute kernels of the application with data items, and grouping the resulting work units into thread groups. The application assigns the thread groups to compute units based on detecting shared data among the compute units. Rather than send multiple read access to a memory subsystem for the shared data, a single access request is generated. The single access request includes information to identify the multiple compute units for receiving the shared data when broadcasted.

Patent Agency Ranking