APPROACH FOR ENFORCING ORDERING BETWEEN MEMORY-CENTRIC AND CORE-CENTRIC MEMORY OPERATIONS

    公开(公告)号:US20220317926A1

    公开(公告)日:2022-10-06

    申请号:US17219446

    申请日:2021-03-31

    Abstract: Ordering between memory-centric memory operations, referred to hereinafter as “MC-Mem-Ops,” and core-centric memory operations, referred to hereinafter as “CC-Mem-Ops,” is enforced using inter-centric fences, referred to hereinafter as an “IC-fences.” IC-fences are implemented by an ordering primitive or ordering instruction, that cause a memory controller, a cache controller, etc., to enforce ordering of MC-Mem-Ops and CC-Mem-Ops throughout the memory pipeline and at the memory controller by not reordering MC-Mem-Ops (or sometimes CC-Mem-Ops) that arrive before the IC-fence to after the IC-fence. Processing of an IC-fence also causes the memory controller to issue an ordering acknowledgment to the thread that issued the IC-fence instruction. IC-fences are tracked at the core and designated as complete when the ordering acknowledgment is received. Embodiments include a completion level-specific cache flush operation which, when used with an IC-fence, provides proper ordering between cached CC-Mem-Ops and MC-Mem-ops with reduced data transfer and completion times.

    Data placement with packet metadata

    公开(公告)号:US12182428B2

    公开(公告)日:2024-12-31

    申请号:US17124872

    申请日:2020-12-17

    Abstract: Systems, apparatuses, and methods for determining data placement based on packet metadata are disclosed. A system includes a traffic analyzer that determines data placement across connected devices based on observed values of the metadata fields in actively exchanged packets across a plurality of protocol types. In one implementation, the protocol that is supported by the system is the compute express link (CXL) protocol. The traffic analyzer performs various actions in response to events observed in a packet stream that match items from a pre-configured list. Data movement is handled underneath the software applications by changing the virtual-to-physical address translation once the data movement is completed. After the data movement is finished, threads will pull in the new host physical address into their translation lookaside buffers (TLBs) via a page table walker or via an address translation service (ATS) request.

    Dynamic multi-bank memory command coalescing

    公开(公告)号:US11681465B2

    公开(公告)日:2023-06-20

    申请号:US16900526

    申请日:2020-06-12

    CPC classification number: G06F3/0659 G06F3/0604 G06F3/0644 G06F3/0673

    Abstract: Systems, apparatuses, and methods for dynamically coalescing multi-bank memory commands to improve command throughput are disclosed. A system includes a processor coupled to a memory via a memory controller. The memory also includes processing-in-memory (PIM) elements which are able to perform computations within the memory. The processor generates memory requests targeting the memory which are sent to the memory controller. The memory controller stores commands received from the processor in a queue, and the memory controller determines whether opportunities exist for coalescing multiple commands together into a single multi-bank command. After coalescing multiple commands into a single combined multi-bank command, the memory controller conveys, across the memory bus to multiple separate banks, the single multi-bank command and a multi-bank code specifying which banks are targeted. The memory banks process the command in parallel, and the PIM elements process the data next to each respective bank.

    Detecting execution hazards in offloaded operations

    公开(公告)号:US11188406B1

    公开(公告)日:2021-11-30

    申请号:US17218506

    申请日:2021-03-31

    Abstract: Detecting execution hazards in offloaded operations is disclosed. A second offload operation is compared to a first offload operation that precedes the second offload operation. It is determined whether the second offload operation creates an execution hazard on an offload target device based on the comparison of the second offload operation to the first offload operation. If the execution hazard is detected, an error handling operation may be performed. In some examples, the offload operations are processing-in-memory operations.

    MEMORY REQUEST PRIORITY ASSIGNMENT TECHNIQUES FOR PARALLEL PROCESSORS

    公开(公告)号:US20210173796A1

    公开(公告)日:2021-06-10

    申请号:US16706421

    申请日:2019-12-06

    Abstract: Systems, apparatuses, and methods for implementing memory request priority assignment techniques for parallel processors are disclosed. A system includes at least a parallel processor coupled to a memory subsystem, where the parallel processor includes at least a plurality of compute units for executing wavefronts in lock-step. The parallel processor assigns priorities to memory requests of wavefronts on a per-work-item basis by indexing into a first priority vector, with the index generated based on lane-specific information. If a given event is detected, a second priority vector is generated by applying a given priority promotion vector to the first priority vector. Then, for subsequent wavefronts, memory requests are assigned priorities by indexing into the second priority vector with lane-specific information. The use of priority vectors to assign priorities to memory requests helps to reduce the memory divergence problem experienced by different work-items of a wavefront.

    Approach for performing efficient memory operations using near-memory compute elements

    公开(公告)号:US12235756B2

    公开(公告)日:2025-02-25

    申请号:US17557568

    申请日:2021-12-21

    Abstract: Near-memory compute elements perform memory operations and temporarily store at least a portion of address information for the memory operations in local storage. A broadcast memory command is then issued to the near-memory compute elements that causes the near-memory compute elements to perform a subsequent memory operation using their respective address information stored in the local storage. This allows a single broadcast memory command to be used to perform memory operations across multiple memory elements, such as DRAM banks, using bank-specific address information. In one implementation, the approach is used to process workloads with irregular updates to memory while consuming less command bus bandwidth than conventional approaches. Implementations include using conditional flags to selectively designate address information in local storage that is to be processed with the broadcast memory command.

    COMMUNICATION REDUCTION TECHNIQUES FOR PARALLEL COMPUTING

    公开(公告)号:US20240119198A1

    公开(公告)日:2024-04-11

    申请号:US17958058

    申请日:2022-09-30

    CPC classification number: G06F30/23 G06F30/27 G06F2119/02

    Abstract: A physical system is simulated using a model including a plurality of elements in a mesh or grid. The elements are divided into partitions processed by different processing units. For some time steps, state data is transmitted between partitions and used to calculate flux data for updating the state of edge elements of the partitions. Periodically, transmission of state data is suppressed, and flux data is obtained by linear interpolation based on past flux data. Alternatively, flux data is obtained by processing state variables of an edge element and past flux data using a machine learning model, such as a DNN. Whether to suppress transmission of state data may be determined based on one or both of (a) uncertainty in an output of the machine learning model (e.g., Bayesian neural network) and (b) complexity of model of the physical system (e.g., spatial or temporal gradients).

Patent Agency Ranking