Device and method for accelerating matrix multiply operations

    公开(公告)号:US11640444B2

    公开(公告)日:2023-05-02

    申请号:US17208526

    申请日:2021-03-22

    IPC分类号: G06F17/16 G06F7/53 G06F15/80

    摘要: A processing device is provided which comprises memory configured to store data and a plurality of processor cores in communication with each other via first and second hierarchical communication links. Processor cores of a first hierarchical processor core group are in communication with each other via the first hierarchical communication links and are configured to store, in the memory, a sub-portion of data of a first matrix and a sub-portion of data of a second matrix. The processor cores are also configured to determine a product of the sub-portion of data of the first matrix and the sub-portion of data of the second matrix, receive, from another processor core, another sub-portion of data of the second matrix and determine a product of the sub-portion of data of the first matrix and the other sub-portion of data of the second matrix.

    HARDWARE-SOFTWARE COLLABORATIVE ADDRESS MAPPING SCHEME FOR EFFICIENT PROCESSING-IN-MEMORY SYSTEMS

    公开(公告)号:US20220276795A1

    公开(公告)日:2022-09-01

    申请号:US17745278

    申请日:2022-05-16

    IPC分类号: G06F3/06 G06F12/02

    摘要: Approaches are provided for implementing hardware-software collaborative address mapping schemes that enable mapping data elements which are accessed together in the same row of one bank or over the same rows of different banks to achieve higher performance by reducing row conflicts. Using an intra-bank frame striping policy (IBFS), corresponding subsets of data elements are interleaved into a single row of a bank. Using an intra-channel frame striping policy (ICFS), corresponding subsets of data elements are interleaved into a single channel row of a channel. A memory controller utilizes ICFS and/or IBFS to efficiently store and access data elements in memory, such as processing-in-memory (PIM) enabled memory.

    Near-memory data reduction
    26.
    发明授权

    公开(公告)号:US11099788B2

    公开(公告)日:2021-08-24

    申请号:US16658733

    申请日:2019-10-21

    IPC分类号: G06F3/06 H03M7/30

    摘要: An approach is provided for implementing near-memory data reduction during store operations to off-chip or off-die memory. A Near-Memory Reduction (NMR) unit provides near-memory data reduction during write operations to a specified address range. The NMR unit is configured with a range of addresses to be reduced and when a store operation specifies an address within the range of addresses, the NRM unit performs data reduction by adding the data value specified by the store operation to an accumulated reduction result. According to an embodiment, the NRM unit maintains a count of the number of updates to the accumulated reduction result that are used to determine when data reduction has been completed.

    Hierarchical register file at a graphics processing unit

    公开(公告)号:US10853904B2

    公开(公告)日:2020-12-01

    申请号:US15079543

    申请日:2016-03-24

    摘要: A processor employs a hierarchical register file for a graphics processing unit (GPU). A top level of the hierarchical register file is stored at a local memory of the GPU (e.g., a memory on the same integrated circuit die as the GPU). Lower levels of the hierarchical register file are stored at a different, larger memory, such as a remote memory located on a different die than the GPU. A register file control module monitors the status of in-flight wavefronts at the GPU, and in particular whether each in-flight wavefront is active, predicted to be become active, or inactive. The register file control module places execution data for active and predicted-active wavefronts in the top level of the hierarchical register file and places execution data for inactive wavefronts at lower levels of the hierarchical register file.

    FAST THREAD WAKE-UP THROUGH EARLY LOCK RELEASE

    公开(公告)号:US20190317832A1

    公开(公告)日:2019-10-17

    申请号:US15952143

    申请日:2018-04-12

    IPC分类号: G06F9/52

    摘要: A thread holding a lock notifies a sleeping thread that is waiting on the lock that the lock holding thread is “about” to release the lock. In response to the notification, the waiting thread is woken up. While the waiting thread is woken up, the lock holding thread completes other operations prior to actually releasing the lock and then releases the lock. The notification to the waiting thread hides latency associated with waking up the waiting thread by allowing operations that wake up the waiting thread to occur while the lock holding thread is performing the other operations prior to releasing the thread.

    TECHNIQUES FOR IMPROVED LATENCY OF THREAD SYNCHRONIZATION MECHANISMS

    公开(公告)号:US20190317831A1

    公开(公告)日:2019-10-17

    申请号:US15952149

    申请日:2018-04-12

    IPC分类号: G06F9/52

    摘要: A memory fence or other similar operation is executed with reduced latency. An early fence operation is executed and acts as a hint to the processor executing the thread that executes the fence. This hint causes the processor to begin performing sub-operations for the fence earlier than if no such hint were executed. Examples of sub-operations for the fence include operations to make data written to by writes prior to the fence operation available to other threads. A resolving fence, which occurs after the early fence, performs the remaining sub-operations for the fence. By triggering some or all of the sub-operations for a memory fence that will occur in the future, the early fence operation reduces the amount of latency associated with that memory fence operation.