STOCHASTIC OPTIMIZATION OF SURFACE CACHEABILITY IN PARALLEL PROCESSING UNITS

    公开(公告)号:US20230195639A1

    公开(公告)日:2023-06-22

    申请号:US17557475

    申请日:2021-12-21

    CPC classification number: G06F12/0893 G06F2212/6042

    Abstract: A processing system selectively allocates storage at a local cache of a parallel processing unit for cache lines of a repeating pattern of data that exceeds the storage capacity of the cache. The processing system identifies repeating patterns of data having cache lines that have a reuse distance that exceeds the storage capacity of the cache. A cache controller allocates storage for only a subset of cache lines of the repeating pattern of data at the cache and excludes the remainder of cache lines of the repeating pattern of data from the cache. By restricting the cache to store only a subset of cache lines of the repeating pattern of data, the cache controller increases the hit rate at the cache for the subset of cache lines.

    VARIABLE DISPATCH WALK FOR SUCCESSIVE CACHE ACCESSES

    公开(公告)号:US20230195626A1

    公开(公告)日:2023-06-22

    申请号:US17558008

    申请日:2021-12-21

    CPC classification number: G06F12/0806 G06F12/10 G06F2212/1016

    Abstract: A processing system is configured to translate a first cache access pattern of a dispatch of work items to a cache access pattern that facilitates consumption of data stored at a cache of a parallel processing unit by a subsequent access before the data is evicted to a more remote level of the memory hierarchy. For consecutive cache accesses having read-after-read data locality, in some embodiments the processing system translates the first cache access pattern to a space-filling curve. In some embodiments, for consecutive accesses having read-after-write data locality, the processing system translates a first typewriter cache access pattern that proceeds in ascending order for a first access to a reverse typewriter cache access pattern that proceeds in descending order for a subsequent cache access. By translating the cache access pattern based on data locality, the processing system increases the hit rate of the cache.

    APPROACH FOR PERFORMING EFFICIENT MEMORY OPERATIONS USING NEAR-MEMORY COMPUTE ELEMENTS

    公开(公告)号:US20230195618A1

    公开(公告)日:2023-06-22

    申请号:US17557568

    申请日:2021-12-21

    CPC classification number: G06F12/06

    Abstract: Near-memory compute elements perform memory operations and temporarily store at least a portion of address information for the memory operations in local storage. A broadcast memory command is then issued to the near-memory compute elements that causes the near-memory compute elements to perform a subsequent memory operation using their respective address information stored in the local storage. This allows a single broadcast memory command to be used to perform memory operations across multiple memory elements, such as DRAM banks, using bank-specific address information. In one implementation, the approach is used to process workloads with irregular updates to memory while consuming less command bus bandwidth than conventional approaches. Implementations include using conditional flags to selectively designate address information in local storage that is to be processed with the broadcast memory command.

    HARDWARE ACCELERATED DYNAMIC WORK CREATION ON A GRAPHICS PROCESSING UNIT

    公开(公告)号:US20230185607A1

    公开(公告)日:2023-06-15

    申请号:US17993490

    申请日:2022-11-23

    CPC classification number: G06F9/4881 G06F9/542 G06F9/545 G06F9/546 G06F9/3877

    Abstract: A processor core is configured to execute a parent task that is described by a data structure stored in a memory. A coprocessor is configured to dispatch a child task to the at least one processor core in response to the coprocessor receiving a request from the parent task concurrently with the parent task executing on the at least one processor core. In some cases, the parent task registers the child task in a task pool and the child task is a future task that is configured to monitor a completion object and enqueue another task associated with the future task in response to detecting the completion object. The future task is configured to self-enqueue by adding a continuation future task to a continuation queue for subsequent execution in response to the future task failing to detect the completion object.

    Dual vector arithmetic logic unit
    289.
    发明授权

    公开(公告)号:US11675568B2

    公开(公告)日:2023-06-13

    申请号:US17121354

    申请日:2020-12-14

    CPC classification number: G06F7/57 G06F9/3867 G06F17/16 G06T1/20 G06F15/8015

    Abstract: A processing system executes wavefronts at multiple arithmetic logic unit (ALU) pipelines of a single instruction multiple data (SIMD) unit in a single execution cycle. The ALU pipelines each include a number of ALUs that execute instructions on wavefront operands that are collected from vector general process register (VGPR) banks at a cache and output results of the instructions executed on the wavefronts at a buffer. By storing wavefronts supplied by the VGPR banks at the cache, a greater number of wavefronts can be made available to the SIMD unit without increasing the VGPR bandwidth, enabling multiple ALU pipelines to execute instructions during a single execution cycle.

    READ CLOCK START AND STOP FOR SYNCHRONOUS MEMORIES

    公开(公告)号:US20230176608A1

    公开(公告)日:2023-06-08

    申请号:US17850299

    申请日:2022-06-27

    CPC classification number: G06F1/08 G06F1/10

    Abstract: A memory includes a read clock state machine and a read clock driver circuit. The read clock state machine has a first input for receiving a read command signal, a second input for receiving a read clock mode signal, and an output for providing a drive enable signal. The read clock driver circuit has an output for providing a read clock signal in response to a clock signal when the drive enable signal is active. When the read clock mode signal indicates a read-only mode, the read clock state machine starts toggling the read clock signal during a read preamble period before a data transmission of a first read command, and continues toggling the read clock signal for at least a read postamble period following the data transmission of the first read command.

Patent Agency Ranking