DYNAMIC QUANTIZATION AND MEMORY MANAGEMENT OF KEY-VALUE CACHE FOR SERVING LARGE LANGUAGE MODELS

    公开(公告)号:US20250061316A1

    公开(公告)日:2025-02-20

    申请号:US18934700

    申请日:2024-11-01

    Abstract: Key-value (KV) cache paging schemes can improve memory management for KV caches by storing a KV cache page having key tensors and value tensors for a fixed number of tokens in a fixed-sized block in the KV cache of a worker. To further improve memory management, the schemes can be modified to implement dynamic variable quantization. Quantization level of a KV cache page can be set based on a runtime importance score of the KV cache page. In addition, the quantization level of the KV cache page can be set based on the system load. The end result is a scheme that can achieve a high compression ratio of KV cache pages in the KV cache. Fitting more KV cache pages in the KV cache can lead to higher inference throughput, higher system-level user capacity, and higher end-to-end service availability.

    METHODOLOGY TO ENABLE HIGHLY RESPONSIVE GAMEPLAY IN CLOUD AND CLIENT GAMING

    公开(公告)号:US20240307773A1

    公开(公告)日:2024-09-19

    申请号:US18478201

    申请日:2023-09-29

    CPC classification number: A63F13/52 A63F2300/66

    Abstract: Described herein is a technique to enhance the responsiveness of gameplay for a 3D gaming application while maintaining the ability to enqueue multiple frames for processing on the GPU. Each frame or a set of workloads within a frame is submitted to the GPU with predication, such that the indicated rendering and resource manipulation commands are not actually performed if the predication condition is enabled. A low latency command can be submitted to the GPU via a copy engine command queue. The command will cause the copy engine to enable or disable predication for command buffers in the command queue. When predication for queued command buffers is enabled, command buffers for workloads that are not related to the workload that is generated in response to the user input are bypassed. High priority command buffers that include workloads generated in response to user input can then be executed immediately.

    SEMANTIC-GUIDED TRANSFORMER FOR OBJECT RECOGNITION AND RADIANCE FIELD-BASED NOVEL VIEW

    公开(公告)号:US20240029455A1

    公开(公告)日:2024-01-25

    申请号:US18475353

    申请日:2023-09-27

    CPC classification number: G06V20/64 G06V20/70 G06T15/20 G06V10/56 G06V10/774

    Abstract: Systems, apparatuses and methods may provide for technology that encodes multi-view visual data into latent features via an aggregator encoder, decodes the latent features into one or more novel target views different from views of the multi-view visual data via a rendering decoder, and decodes the latent features into an object label via a label decoder. The operation to decode the latent features via the rendering decoder and to decode the latent features via the label decoder occur at least partially at the same time. The operation to encode, via the aggregator encoder, the multi-view visual data into the latent features further includes operations to: perform, via the aggregator encoder, semantic object recognition operations based on radiance field view synthesis operations, and perform, via the aggregator encoder, radiance field view synthesis operations based on semantic object recognition operations.

Patent Agency Ranking