COLLECTIVE OPERATION USING A NETWORK-ATTACHED MEMORY

    公开(公告)号:US20250021273A1

    公开(公告)日:2025-01-16

    申请号:US18349318

    申请日:2023-07-10

    Abstract: In some examples, a processor receives a first request to allocate a memory region for a collective operation by process entities in a plurality of computer nodes. In response to the first request, the processor creates a virtual address for the memory region and allocates the memory region in a network-attached memory coupled to the plurality of computer nodes over a network. The processor correlates the virtual address to an address of the memory region in mapping information. The processor identifies the memory region in the network-attached memory by obtaining the address of the memory region from the mapping information using the virtual address in a second request. In response to the second request, the processor performs the collective operation.

    OPERATIONS ON DATA FOR COMMANDS IN INTERACTIVE PROGRAMMING SESSIONS

    公开(公告)号:US20240406251A1

    公开(公告)日:2024-12-05

    申请号:US18494960

    申请日:2023-10-26

    Abstract: In some examples, a system having a plurality of computer nodes receives a command based on program code of a program being developed in an interactive programming session. The system distributes data items from a network-attached memory to a distributed data object having data in node memories of the plurality of computer nodes. A dataset manager performs an operation specified by the command on the distributed data object, the operation executed in parallel on the plurality of computer nodes. The dataset manager produces derived data generated by the operation on the distributed data object, the derived data accessible by a programmer in the interactive programming session.

    Programming model and framework for providing resilient parallel tasks

    公开(公告)号:US10942824B2

    公开(公告)日:2021-03-09

    申请号:US16153833

    申请日:2018-10-08

    Abstract: Exemplary embodiments herein describe programming models and frameworks for providing parallel and resilient tasks. Tasks are created in accordance with predetermined structures. Defined tasks are stored as data objects in a shared pool of memory that is made up of disaggregated memory communicatively coupled via a high performance interconnect that supports atomic operations as descried herein. Heterogeneous compute nodes are configured to execute tasks stored in the shared memory. When compute nodes fail, they do not impact the shared memory, the tasks or other data stored in the shared memory, or the other non-failing compute nodes. The non-failing compute nodes can take on the responsibility of executing tasks owned by other compute nodes, including tasks of a compute node that fails, without needing a centralized manager or schedule to re-assign those tasks. Task processing can therefore be performed in parallel and without impact from node failures.

    SYSTEM AND METHOD FOR FACILITATING EFFICIENT MANAGEMENT OF DATA STRUCTURES STORED IN REMOTE MEMORY

    公开(公告)号:US20230019758A1

    公开(公告)日:2023-01-19

    申请号:US17377777

    申请日:2021-07-16

    Abstract: A system and method are provided for facilitating efficient management of data structures stored in remote memory. During operation, the system receives a request to allocate memory for a first part in a data structure stored in a remote memory associated with a compute node in a network. The system pre-allocates a buffer in the remote memory for a plurality of parts in the data structure and stores a first local descriptor associated with the buffer in a local worker table stored in a volatile memory of the compute node. The first local descriptor facilitates servicing future access requests to the first and other parts in the data structure. The system stores a first global descriptor for the buffer in a shared global table stored in the remote memory and generates a first reference corresponding to the first part, thereby facilitating faster traversals of the data structure.

    SYSTEMS AND METHODS FOR AGGREGATE BANDWIDTH AND LATENCY OPTIMIZATION

    公开(公告)号:US20190334771A1

    公开(公告)日:2019-10-31

    申请号:US15967583

    申请日:2018-04-30

    Abstract: Systems and methods for dynamically and programmatically controlling hardware and software to optimize bandwidth and latency across partitions in a computing system are discussed herein. In various embodiments, performance within a partitioned computing system may be monitored and used to automatically reconfigure the computing system to optimize aggregate bandwidth and latency. Reconfiguring the computing system may comprise reallocating hardware resources among partitions, programming firewalls to enable higher bandwidth for specific inter-partition traffic, switching programming models associated with individual partitions, starting additional instances of one or more applications running on the partitions, and/or one or more other operations to optimize the overall aggregate bandwidth and latency of the system.

Patent Agency Ranking